# Index | Page references in bold represent figures and tables. Numbers 2:1 cache rule of thumb, C-28 3ASC Purple pSeries 575, E-20, E-44, E-56 80x86 processors. See Intel 80x86 99,999% (five nines) claims, 399 A ABC (Atanasoff Berry Computer), K-5 ABI (application binary interface), B-20 absolute addressing mode, B-9 absolute value function, G-24 Accelerated Strategic Computing Initiative (ASCI), E-20, E-44, E-56 access bits, C-50 access time, 310, F-15 to F-16 access time gap, in disks and DRAM, 359, 359 accumulator architecture, B-3, B-4 acknowledgments, 217, H-39 to H-41 ACS, K-20 to K-21 Ada, integer division and remainder in, I-12 adaptive routing, E-47, E-53, E-54, E-73, E-93 to E-94 adders carry-lookahead, 38, I-37 to I-41, I-38, I-40, I-41, I-42, I-44 carry-propagate, I-48 | carry-skip, I-41 to I-43, I-42, I-44 faster division with one, I-54 to I-58, I-55, I-56, I-57 faster multiplication with many, I-50 to I-54, I-50 to I-54 faster multiplication with single, I-47 to I-50, I-48, I-49 ripple-carry, I-2 to I-3, I-3, I-42, I-44 addition. See also adders denormalized numbers, I-26 to I-27 floating-point, I-21 to I-27, I-24, I-36 multiple-precision, I-13 speeding up, I-25 to I-26, I-37 to I-44 address aliasing prediction, 130 address faults, C-40 address mapping in AMD Opteron, C-12 to C-13, C-53 to C-54, C-54 in multibanked caches, 299, 299 page coloring, C-37 in trace caches, 296, 309 address size, importance of, C-56 address space, shared, 202 address specifiers, B-21, J-68 address translations (memory mapping). See also translation lookaside buffers in AMD Opteron, C-44 to C-45, C-45 during cache indexing, 291–292, C-36 to C-38, C-37, C-39 caches dedicated to C-43 | in virtual memory, C-40, C-44 to C-47, C-45, C-47 addressing modes in embedded architectures, J-5 to J-6, J-6 in Intel 80x86, J-47 to J-49, J-50, J-59 to J-62, J-59 to J-62 in MIPS data transfers, B-34 to B-35 paged vs. segmented, C-3, C-41, C-42 real, J-45, J-50 in RISC desktop architectures, J-5 to J-6, J-5 types of, B-9 to B-10, B-9, B-11, B-12, B-13 in VAX, J-67, J-70 to J-71 in vector computers, B-31 to B-32 advanced load address table (ALAT), F-27, G-40 advanced mobile phone service (AMPS), D-25 Advanced Research Project Agency (ARPA), E-97, E-99 Advanced RISC Machines. See ARM Advanced RISC Machines Thumb. See ARM Thumb Advanced Switching Interconnect (ASI), E-103 affine array indexes, G-6 age-based, E-49 aggregate bandwidth, E-13, E-17, E-24 Aiken, Howard, K-3 to K-4 AI AT table, F-27, G-40 | |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | carry-lookahead, 38, I-37 to I-41, | during cache indexing, 291-292, | E-24 | ### **I-2** Index | alias prediction. See address aliasing | data cache organization, C-12 to | in shared-media networks, E-23 | |-------------------------------------------------------------------|---------------------------------------|------------------------------------------| | prediction | C-14, <b>C-13, C-15</b> | in switch microarchitecture, E-57 | | aliases, C-36 | dies, 22, 23 | to E-58, E-60 to E-61, E-62 | | Alles, A., E-98 | interconnection networks in, | in switched-media networks, E-24 | | Alliant computers, vector processors | 216-217, <b>216</b> | techniques for, E-49, E-49 | | in, F-49 | L1 cache size in, 294 | arbitration algorithm, E-49, E-52 | | Alpha | memory hierarchy performance, | areal density, 358 | | addressing modes in, J-5 to J-6, | 331–335, <b>332, 333, 334</b> | arithmetic. See computer arithmetic | | J-5 | memory hierarchy structure, | arithmetic mean, 36 | | architecture overview, <b>J-4</b> | 326–331, <b>327, 328, 341</b> | ARM | | common MIPS extensions in, J-19 | multibanked caches in, 299, 309 | addressing modes in, J-5 to J-6, | | to J-24, <b>J-21 to J-23</b> | multilevel exclusion in, C-34 | J-6 | | conditional branch options in, | organization of, 326, 327 | architecture overview, J-4 | | B-19 | Pentium 4 compared to, 136–138, | common extensions in, J-19 to | | instructions unique to, J-27 to | 137, 138 | J-24, <b>J-23, J-24</b> | | J-29 | performance on SPEC | conditional branch options in, | | MIPS core subset in, J-6 to J-16, | benchmarks, 35, 37, | B-19, J-17 | | J-7, J-9 to J-13, J-17 | 255–257, <b>255, 256, 257</b> | instructions unique to, J-36 to | | page size changes in, C-56 to | translation lookaside buffer | J-37 | | C-57 | organization, C-44 to C-45, | MIPS core subset in, J-6 to J-16, | | Alpha 21064, A-43 | C-45 | J-8, J-9, J-14 to J-17 | | Alpha 21164, 220 | AMD Pacifica, 320, 339 | multiply-accumulate in, J-19, | | Alpha 21264, 88–89, 140 | Amdahl's Law | J-20 | | Alpha MAX, <b>D-11</b> , <b>J-16</b> to <b>J-19</b> , <b>J-18</b> | law of diminishing returns and, 40 | ARM Thumb | | AlphaServer 4100, 220–221, <b>221</b> | limited available parallelism and, | addressing modes in, J-5 to J-6, | | AltaVista benchmark, 221, 221, 222 | 202–203 | J-6 | | ALU instructions | in multiple-issue processor | architecture overview, J-4 | | in media extensions, D-10 to D-11 | performance, 184 | common extensions in, J-19 to | | memory addressing and, B-12 | parallel computers and, 258-259 | J-24, <b>J-23, J-24</b> | | in MIPS architecture, B-37, B-37 | pitfalls in using, 48 | instructions unique to, J-37 to | | operand format in, B-5, B-6 | resource allocation and, 40 | J-38 | | overview of, A-4 to A-5 | speedup determination, 39-40 | MIPS core subset in, J-6 to J-16, | | in unpipelined MIPS | America processor, K-21 to K-22 | J-8, J-9, J-14 to J-17 | | implementation, A-27 to | AMPS (advanced mobile phone | multiply-accumulate in, J-19, | | A-28, A-30 | service), D-25 | J-20 | | AMAT. See average memory access | Anderson, S. F., I-63 | reduced code size in, B-23 | | time | Andreessen, Marc, E-98 | ARPA (Advanced Research Project | | AMD 64, J-46 | Andrew benchmark, 225 | Agency), E-97, E-99 | | AMD Athlon 64 processor | annulling, G-26 | ARPANET, E-97 to E-98 | | L1 cache size in, 294 | antialiasing, C-36 to C-37 | arrays, age of access to, 304, 304 | | performance of, G-43 | antidependences, 70, G-7 to G-8, K-23 | array indexes, G-6 | | SMT performance of, 179-181, | Apple Macintosh, memory addressing | array multipliers, I-50 to I-54, I-50 to | | 179, 180, 181 | in, K-53 | 1-54 | | AMD ElanSC520, D-13, <b>D-13</b> | application binary interface (ABI), | array processors, K-36 | | AMD K6, 294, D-13, <b>D-13</b> | B-20 | ASCI (Accelerated Strategic | | AMD Opteron processor | applied load, E-53 | Computing Initiative), E-20, | | 64-bit memory management, | arbitration | E-44, E-56 | | C-53 to C-55, <b>C-54, C-55</b> | arbitration algorithm, E-49 to | ASCI Red Paragon, E-20, E-44, E-56 | | antialiasing in, C-36 | E-50, <b>E-49,</b> E-52 | ASCI White SP Power3, E-20, E-44, | | cache-coherent multiprocessors, | overview of, E-21 to E-22 | E-56 | | 215 | | | | ASI (Advanced Switching | in split vs. unified caches, C-15 to | vector performance and, F-45 | |--------------------------------------------------|----------------------------------------------------------------|-----------------------------------------------| | Interconnect), E-103 | C-17 | bandwidth gap, in disks and DRAM, | | associativity | for two-level caches, C-30 | 359 | | access times and, 294-295, <b>294</b> | average memory stall times, 298, | bandwidth matching, E-112 | | cache indexes and, 38, 291, C-38 | C-56, C <b>-57</b> | Banerjee, U., F-51, K-23 | | cache size and, 292 | average reception factor, E-26, E-32 | Banerjee tests, K-23 | | miss rates and, 291, C-28 to C-29, | average residual service time, 384 | bank busy time, F-15, F-16 | | C-29, C-39 | | Barnes-Hut algorithm | | in multilevel caches, C-33 to C-34 | В | characteristics of, H-8 to H-9, | | in virtual memory, C-42 | back substitution, G-10 | Н-11 | | Astronautics ZS-1, K-22 | backpressure, E-65 | in distributed-memory | | asynchronous events, A-40, A-41. | Baer, JL., K-54 | multiprocessors, H-28 to | | A-42 | bandwidth | Н-32 | | asynchronous I/0, 391 | aggregate, E-13, E-17, E-24 | in symmetric shared-memory | | asynchronous transfer mode (ATM) | bisection, E-39, E-41, E-42, E-55, | multiprocessors, H-21 to | | development of, E-98, E-99 | E-89 | H-26, H-23 to H-26 | | Ethernet compared with, E-89, | communication, H-3 | barrier networks, H-42 | | E-90 | defined, 15, C-2, E-13 | barrier synchronization, H-13 to H-16, | | packet format for, E-75 | | H-14, H-15, H-16 | | as telecommunications standard, | distributed memory and, 230 | base fields, C-50 | | E-79 | DRAM improvements, 313 | base registers, A-4 | | virtual output queues and, E-60 | in floating-point computations, | base station architectures, D-22 | | as wide area network, E-4 | I-62 | based plus scaled index mode, J-58 | | ATA disks, 360–361, <b>361</b> , 365 | full bisection, E-39, E-41<br>high-memory, 337–338, <b>339</b> | basic blocks, 67 | | Atanasoff, John, K-5 | | Baskett, F., F-48 | | Atanasoff Berry Computer (ABC). | improvements in, 15, <b>16</b> | Bell, G., C-56, K-14, K-37 to K-39, | | K-5 | injection, E-18, E-26, E-41, E-55, | K-52 | | Athlon 64 processor. See AMD Athlon | E-63 | benchmark suites, 30 | | 64 processor | integrated instruction fetch units | benchmarks, 29-33. See also SPEC | | Atlas computer, K-52 | and, 126–127 | AltaVista, 221, 221, 222 | | ATM. See asynchronous transfer mode | I/O, 371 | Andrew, 225 | | atomic exchange synchronization | link injection, E-17 | changes over time, 50 | | primitives, 238–240 | link reception, E-17 | compiler flags for, 29 | | | main memory, 310 | of dependability, 377–379, <b>378</b> | | atomic operations, 214 | multibanked caches and, | desktop, 30–32 | | "atomic swap," J-20, <b>J-21</b> | 298–299, <b>299, 309</b> | EEMBC, 30, D-12 to D-13, <b>D-12</b> , | | attributes field, C-50 to C-51, C-51 | in multiple processors, 205 | D-13, D-14 | | autodecrement addressing mode, B-9 | network performance and, E-16 to | for embedded systems, D-12 to | | autoincrement addressing mode, <b>B-9</b> | E-19, <b>E-19</b> , E-25 to E-29, | D-13, <b>D-12</b> , <b>D-13</b> , <b>D-14</b> | | availability claims, 399–400, <b>400</b> . See | E-28, E-89, E-90 | evolution of SPEC, 30–32, <b>31</b> | | also reliability | network switching and, E-50, | historical development of, K-6 to | | average memory access time (AMAT) | E-52 | K-7 | | associativity and, C-28 to C-29, | network topology and, E-41 | Linpack, F-8 to F-9, F-37 to F-38 | | C-29 | nonblocking caches and, | NAS parallel, F-51 | | block size and, C-26 to C-28, | 296–298, <b>297, 309</b> | Perfect Club, F-51 | | C-26, C-27 | overestimating, 336, 338 | reports of, 33 | | cache size and, 295 | packet discarding and, E-65 | reproducibility of, 33 | | formula for, 290, C-15, <b>C-21</b> | pipelined cache access and, 296, | for servers, 32–33 | | in multilevel caches. C-30 to C-31 | <b>309,</b> A-7 | SFS, 376 | | in out-of-order computers, C-19<br>to C-21, C-21 | reception, E-18, E-26, E-41, E-55. | source code modifications and, | | processor performance and, C-17 | E-63, E-89 | 29–30 | | to C-19 | return address predictors and, 125, | suites, 30 | | ₩ C-19 | 126 | | ### I-4 Index | benchmarks (continued) | multilevel inclusion and, 248-249 | branch folding, 125 | |--------------------------------------------|--------------------------------------------------------------------------------|-----------------------------------------------| | summarizing results from, 33-37, | multiprogrammed cache misses | branch hazards, A-21 to A-26. See also | | 35 | and, 227–230, <b>228, 229</b> | control hazards | | synthetic, 29 | in shared-memory | instruction fetch cycle and, A-21. | | Transaction Processing Council, | multiprocessors, H-27 to | A-21 | | 32, 374–375, <b>375</b> | H-29, <b>H-29, H-31</b> | in MIPS pipeline, A-35 to A-37, | | Web server, 32–33 | SMP cache misses and, 223-224 | A-38 | | Benes topology, E-33, E-33 | block transfer engines (BLT), E-87, | in MIPS R4000 pipeline, A-64, | | Beowulf project, K-42 | E-87 | A-65 | | BER (bit error rate), D-21 to D-22 | blocked floating point, D-6 | performance of branch schemes, | | Berkeley RISC processor, K-12 to | blocking | A-25 to A-26, <b>A-26</b> | | K-13 | in centralized switched networks, | reducing penalties, A-21, A-22 to | | Berkeley's Tertiary Disk project, 368, | E-32 | A-25, A-22, A-23, A-24 | | <b>369,</b> 399, <b>399</b> | network topology and, E-41 | restarting, A-42 to A-43 | | Berners-Lee, Tim, E-98 | to reduce miss rate, 303-305, <b>304</b> | branch history tables, 82-86, <b>83, 84,</b> | | between vs. within instructions, A-41, | in switching, E-51 | 85 | | A-42 | blocking factors, 303 | branch prediction. See hardware | | biased exponents, I-15 to I-16 | blocks | branch prediction; static | | biased system, for signed numbers, I-7 | defined, C-2 | branch prediction | | bidirectional multistage, E-33 | destination, H-8 | branch registers, J-32 to J-33 | | Big Endian byte order, B-7, B-34 | in directory-based protocols, | branches | | Bigelow, Julian, K-3 | 234–237, <b>235</b> | branch target distances, B-18, | | BINAC, K-5 | dirty vs. clean, C-10 | B-18 | | binary tree multipliers, I-53 to I-54 | exclusive, 210–211, <b>214, 215</b> | conditional branch operations, | | binary-coded decimal formats, B-14 | invalid, 211–212, <b>213, 214, 215</b> | B-19, <b>B-19, B-20</b> | | binary-to-decimal conversion, I-34 | modified, 211, <b>213</b> , 231 | in control flow instructions, B-16, | | Birman, M. A., I-58 | owners of, 211, 231, <b>235</b> | <b>B-17,</b> B-18, B-37 to B-38, | | bisection bandwidth, E-39, E-41, | placement in main memory, C-42 | B-38 | | E-42, E-55, E-89 | replacement, C-9, C-10, C-14, | history tables, 82–86, <b>83, 84, 85</b> | | bisection traffic fraction, E-41 to E-42, | C-40, C-43 to C-44 | in IBM 360, J-86 to J-87, <b>J-86</b> , | | E-55 | set-associative placement of, 38, | J-87 | | bit error rate (BER), D-21 to D-22 | 289 | penalties, A-36 to A-37, <b>A-39</b> | | bit selection, C-7 | shared, 211, <b>213, 214, 215,</b> 231 | registers, J-32 to J-33 | | bit vectors, 232 | state transition diagrams, | in RISC architecture, A-5 | | bits | 234–236, <b>235, 236</b> | straightening, 302 | | access, C-50 | uncached, 231 | vectored, J-35 | | dirty, C-10, C-44 | unmodified, 214 | branch-prediction buffers, 82–86, <b>83</b> , | | NaT (Not a Thing), G-38 G-40 | victim, 301 | 84, 85 | | poison, G-28, G-30 to G-32 | write invalidate protocols and, 211 | branch-target buffers (BTB), 122–125, | | present, C-50 | Blue Gene/L. See IBM Blue Gene/L | 122, 124 | | sticky, I-18 | BOMB, K-4 | · · · · · · · · · · · · · · · · · · · | | use, C-43 to C-44 | Booth recoding, I-8 to I-9, <b>I-9</b> , I-48 to | branch-target calculations, A-35, A-38, | | valid, C-8 | I-49, <b>I-49</b> | A-39 | | block addressing, 299, <b>299</b> , C-8 to | | breakpoints, A-40, A-42 | | C-9, C-8 | Bouknight, W. Jack, 195, K-36 | Brent, R. P., I-63 | | , | bounds checking, C-50, C-51 | bridges, E-78 | | block multithreading, K-26 | branch costs, 80–89, <b>81</b> , <b>84</b> , <b>85</b> , <b>87</b> , <b>88</b> | bristling, E-38, E-92 | | block offsets, C-8, C-8 | branch delay, A-35 to A-37, A-60, | broadcasting, E-24, H-35 to H-36 | | block servers, 390–391 | A-60, A-65. See also delayed | Brooks, F. P., Jr., C-49 | | block size | branch schemes | BTB (branch-target buffers), 122–125, | | miss rates and, 252, <b>252</b> , 291, | branch delay slot, A-23 to A-25, <b>A-23</b> , | 122, 124 | | C-25 to C-28, <b>C-26, C-27</b> , | A-24 | bubble flow control, E-53, E-73 | | C-39 | branch displacements, J-39 | | | bubbles, pipeline, A-13, A-20, E-47. | C language, integer division and | state diagrams and, 214, 215 | |--------------------------------------------------|-------------------------------------------------------------|---------------------------------------------------------| | See also pipeline stalls | remainder in, I-12 | write invalidate protocol | | buckets, in histograms, 382 | caches | implementation, 209-211 | | buffered crossbars, E-62 | 2:1 cache rule of thumb, C-28 | cache coherence protocols. See also | | buffered wormhole switching, E-51 | AMD Opteron data cache | cache coherence problem; | | buffers. See also translation lookaside | example, C-12 to C-14, C-13, | directory-based cache | | buffers | C-15 | coherence protocols | | branch-prediction, 82–86, <b>83, 84, 85</b> | block addressing, 299, <b>299,</b> C-8 to C-9, C-8 | avoiding deadlock from limiting buffering, H-38 to H-40 | | branch-target, 122-125, 122, 124 | block placement in, C-7 to C-8, | directory controller | | buffered crossbars, E-62 | C-7 | implementation, H-40 to | | central, E-57 | block size, H-25 to H-26, H-25 | H-41 | | development of, K-22 | data, C-9, C-13, C-15, F-46 | in distributed shared-memory | | in disk storage, 360, <b>360</b> | defined, C-2 | multiprocessors, H-36 to | | instruction, A-15 | development of, K-53 | H-37 | | limited, H-38 to H-40 | in IBM Blue Gene/L, H-42 | in distributed-memory | | load, 94, <b>94,</b> 95, 97, <b>101,</b> 102–103 | interprocessor communication | multiprocessors, 232-233, | | reorder, 106-114, 107, 110, 111, | and, H-5 | 232, 233 | | 113, G-31 to G-32 | L1 (See L1 cache) | in large-scale multiprocessors, | | streaming, K-54 | L2 (See L2 cache) | H-34 to H-41 | | victim, 301, 330, C-14 | multibanked, 298-299, 299, 309 | memory consistency in, 243-246 | | write, 210, 289, 291, 300-301, | multilevel, 291 | snooping protocols and, 208-218, | | 301, 309 | nonblocking, 296-298, 297, 309, | H-34, H-35 | | bundles, G-34 to G-36, G-36, G-37 | K-54 | spin lock scheme and, 241-242, | | Burks, A. W., 287, I-62, K-3 | remote communication latency | 242 | | buses | and, 205 | synchronization and, 240-242, | | in barrier synchronization, H-15 | in RISC pipelines, A-7 | 242 | | to H-16, <b>H-16</b> | separated, C-14 | uniform memory access and, 217 | | bottlenecks in, 216, 216 | SMT challenges to, 176–177 | cache CPI equation, 168 | | data misses and, H-26, H-26 | states of, 212 | cache hierarchy. See also memory | | development of, K-62 to K-63 | in superscalar processors, 155 | hierarchy | | fairness in, H-13 | tags in, 210-211, 289, C-36 | in AlphaServer 4100, 220 | | point-to-point links replacing, | trace, 131, <b>132, 133,</b> 296, <b>309</b> | cache organization overview, | | 390, <b>390</b> | victim, 301, K-54 | 288–293, <b>292</b> | | in scoreboarding, A-70, A-73 | virtual, C-36 to C-38, <b>C-37</b> | multilevel inclusion, 248–249 | | in shared-media networks, E-22, | virtual memory compared with, | cache hits, C-2 | | E-22, E-40 | C-40, <b>C-41</b> | cache indexing | | single-bus systems, 217–218 | writes in, C-9 to C-12, <b>C-10,</b> C-13 | address translation during, | | in snooping coherence protocols, | cache access, pipelined, 296 | 291–292, C-36 to C-38, <b>C-37</b> , | | 211–212, <b>213, 214, 215</b> | cache associativity. See associativity | C-39 | | in Tomasulo's approach, 93, 95, | cache banks, 298–299, <b>299</b> | in AMD Opteron, 326, 329, C-12 | | 96, 98, 101 | cache blocks. See blocks | equation for, 326, 329, <b>C-21</b> | | in write invalidates, 209–210, 212, | cache coherence problem. See also | index size and, C-38, C-46 | | 213 | cache coherence protocols | cache misses | | bypassing. See forwarding | cache coherence protocols and, | block replacement in, C-10, C-14 | | byte addressing, 9, 299 | 207–208 | categories of, C-22 to C-24, C-23, C-24 | | byte order, B-7 to B-8, <b>B-8</b> | I/O, 325–326 | communication of, H-35 to H-36 | | | overview of, 205–207, <b>206</b> | defined, C-2 | | <b>C</b> | snooping protocols, 208–209, <b>209</b> | in in-order execution, C-2 to C-3 | | C description language extensions, | snooping protocol example,<br>211–215, <b>213, 214, 215</b> | in invalidate protocols, 210–211 | | <b>B-9,</b> B-36 to B-37 | 211-213, <b>213, 214, 213</b> | cache misses | | | | cache imoses | | | | | | cache misses (continued) | overview of, C-3 to C-6 | carry-select adders, I-43 to I-44, I-43, | |---------------------------------------------------------------|-------------------------------------------------|-------------------------------------------| | nonblocking caches and, | pipelined cache access and, 296, | I-44 | | 296–298, <b>297, 309</b> | 309 | carry-skip adders, I-41 to I-43, I-42, | | processor performance and, C-17 | predicting from other programs, | I-44 | | to C-19 | 335. <b>335</b> | CAS (column access strobe), 311–313, | | in SMP commercial workloads, | sufficient simulations for, 336, | 313 | | 222–223, <b>222, 223</b> | 337 | case statements, register indirect | | cache optimizations, C-22 to C-38 | trace caches and, 296, 309 | jumps for, B-18 | | average memory access time | way prediction and, 295, 309 | CCD (charged-couple device), D-19 | | formula, 290, C-15, C-21 | cache prefetching, 126–127, 305–306. | CDB (coramon data bus), 93, 95, 96 | | avoiding address translation | 306, 309 | 98, 101 | | during cache indexing, C-36 | cache replacement miss, 214 | CDC 6600 processor | | to C-38, C-37, C-39 | cache size | data trunks in, A-70 | | categories of, C-22 | 2:1 cache rule of thumb, C-28 | dynamic scheduling in, 95, A-67 | | higher associativity and, C-28 to | hit time and, 293-295, 294, 309 | to A-68, <b>A-68,</b> K-19 | | C-29, C-29, C-39 | miss rates and, 252, 252, 291, | multithreading in, K-26 | | larger block sizes and, C-25 to | C-25 to C-28, C-26, C-27, | pipelining in, K-10 | | C-28, C-26, C-27, C-39 | C-39 | CDC STAR-100, F-44, F-47 | | larger cache sizes and, C-23, | multiprogrammed workload | CDC vector computers, F-4, F-34 | | C-24, C-28, C-39 | misses and, 227-230, 228, | CDMA (code division multiple | | miss rate components and, C-22 | 229 | access), D-25 | | to C-25, C-23, C-24 | performance and, H-22, H-24, | Cell Broadband Engines (Cell BE), | | multilevel caches and, C-29 to | <b>H-24,</b> H-27, <b>H-28</b> | E-70 to E-72, <b>E-71</b> | | C-34, <b>C-32, C-39</b> | SMP workload misses and, | cell phones, D-20 to D-25, <b>D-21</b> , | | read priorities over writes, C-34 to | 223–224, <b>223, 224, 226</b> | D-23, D-24 | | C-35, C-39 | cache-only memory architecture | cells, in octrees, H-9 | | cache performance, C-15 to C-21 | (COMA), K-41 | centralized shared-memory | | average memory access time and, | CACTI, 294, <b>294</b> | architectures, 199–200, <b>200</b> . | | 290, 295, C-15 to C-17 | call gates, C-52 | See also symmetric | | cache size and, 293–295, <b>294, 309</b> | callee saving, B-19 to B-20, B-41 to | shared-memory | | compiler optimizations and, | B-43 | multiprocessors | | 302–305, <b>304, 309</b> | caller saving, B-19 to B-20 | centralized switched networks, E-30 to | | compiler-controlled prefetching | canceling branches, A-24 to A-25 | E-34, E-31, E-33, E-48 | | and, 305–309, <b>309</b> | canonical form, C-53 | centrally buffered, E-57 | | critical word first and early restart,<br>299–300, <b>309</b> | capabilities, in protection, C-48 to C-49, K-52 | CFM (current frame pointer), G-33 to G-34 | | hardware prefetching and, 305, | capacitive load, 18 | Chai, L., E-77 | | 306, 309 | capacity misses | chaining, F-35, <b>F-35</b> | | high memory bandwidth and, | defined, 290, C-22 | channel adapters, E-7 | | 337–338, <b>339</b> | relative frequency of, C-22, C-23, | channels, D-24 | | merging write buffers and, | C-24 | character operands, B-13 | | 300–301, <b>301, 309</b> | in shared-memory | character strings, B-14 | | miss penalties and out-of-order | multiprocessors, H-22 to | charged-couple devices (CCD), D-19 | | processors, C-19 to C-21, | H-26, <b>H-23 to H-26</b> | checksum, E-8, E-12 | | C-21 | carrier sensing, E-23 | chimes, F-10 to F-12, F-20, F-40 | | multibanked caches and, | carrier signals, D-21 | choke packets, E-65 | | 298–299, <b>299, 309</b> | carry-lookahead adders (CLA), 38, | Cholesky factorization method, H-8 | | nonblocking caches and, | I-37 to I-41, I-38, I-40 to | CIFS (Common Internet File System), | | 296–298, <b>297, 309</b> | <b>I-42, I-44,</b> I-63 | 391 | | optimization summary, 309, 309 | carry-propagate adders (CPAs), I-48 | circuit swatching, E-50, E-64 | | overemphasizing DRAM | carry-save adders (CSAs), I-47 to I-48. | circular queues, E-56 | | bandwidth, 336, <b>338</b> | <b>I-48, I-50,</b> I-55 | | | | | | | CISC (complex instruction set | code scheduling. See also dynamic | Common Internet File System (CIFS), 391 | |---------------------------------------------------|------------------------------------------|-----------------------------------------| | computer), J-65 | scheduling | communication | | CLA (carry-lookahead adders), 38, | for control dependences, 73–74 | bandwidth, H-3 | | I-37 to I-41, I-38, I-40 to | global, 116, G-15 to G-23, <b>G-16</b> , | cache misses and, H-35 to H-36 | | <b>I-42, I-44,</b> I-63 | G-20, G-22 | global system for mobile | | clock cycles (clock rate) | local, 116 | communication, D-25 | | associativity and, C-28 to C-29, | loop unrolling and, 79–80, | | | C-29 | 117–118 | interprocessor, H-3 to H-6 | | CPI and, 140–141 | static scheduling, A-66 | latency, H-3 to H-4 | | memory stall cycles and, C-4 to | code size, 80, 117, D-3, D-9 | message-passing vs. | | C-5, C-20 | CodePack, B-23 | shared-memory, H-4 to H-6 | | processor speed and, 138–139, | coefficient of variance, 383 | multiprocessing models, 201–202 | | 139 | coerced exceptions, A-40 to A-41, | NEWS, E-41 to E-42 | | SMT challenges and, 176, 179, | A-42 | peer-to-peer, E-81 to E-82 | | 181, 183 | coherence, 206-208. See also cache | remote access, 203–204 | | clock cycles per instruction (CPI) | coherence problem; cache | user-level, E-8 | | in AMD Opteron, 331-335, 332, | coherence protocols | compare, select, and store units | | 333, 334 | coherence misses | (CSSU), D-8 | | cache, 168 | defined, 218, C-22 | compare and branch instruction, B-19, | | cache misses and, C-18 | in multiprogramming example, | B-19 | | computation of, 41-44, 203-204 | 229 | compare instructions, B-37 | | ideal pipeline, 66–67, <b>67</b> | in symmetric shared-memory | compiler optimization, 302-305 | | in Pentium 4, 134, 136, 136 | multiprocessors, H-21 to | branch straightening, 302 | | pipelining and, A-3, A-7, A-8, | H-26, <b>H-23 to H-26</b> | compared with other techniques, | | A-12 | true vs. false sharing, 218-219 | 309 | | processor speed and, 138-139 | cold-start misses, C-22. See also | compiler structure and, B-24 to | | in symmetric shared-memory | compulsory misses | B-26, <b>B-25</b> | | multiprocessors, 221, 222 | collision detection, E-23 | examples of, B-27, <b>B-28</b> | | clock rate. See clock cycles | collision misses, C-22 | graph coloring, B-26 to B-27 | | clock skew, A-10 | collocation sites, E-85 | impact on performance, B-27, | | Clos topology, E-33, <b>E-33</b> | COLOSSUS, K-4 | B-29 | | clusters | column access strobe (CAS), 311–313. | instruction set guidelines for, | | commodity vs. custom, 198 | 313 | B-29 to B-30 | | development of, K-41 to K-44 | column major order, 303 | loop interchange, 302-303 | | in IBM Blue Gene/L, H-41 to | COMA (cache-only memory | phase-ordering problem in, B-26 | | H-44, <b>H-43, H-44</b> | architecture), K-41 | reducing code size and, B-43, | | Internet Archive, 392–397, <b>394</b> | combining trees, H-18 | B-44 | | in large-scale multiprocessors, | commercial workloads | technique classification, B-26, | | H-44 to H-46, <b>H-45</b> | Decision Support System, 220 | B-28 | | Cm* multiprocessor, K-36 | multiprogramming and OS | in vectorization, F-32 to F-34, | | C.mmp project, K-36 | performance, 225–230, <b>227</b> , | F-33, F-34 | | CMOS chips, 18–19, 294, <b>294</b> , F-46 | 228, 229 | compilers | | coarse-grained multithreading, | online transaction processing, 220 | compiler-controlled prefetching. | | 173–174, <b>174,</b> K-26. See also | SMP performance in, 220–224, | 305–309, <b>309</b> | | multithreading | 221 to 226 | development of, K-23 to K-24 | | Cocke, John, K-12, K-20, K-21 to | committed instructions, A-45 | eliminating dependent | | K-22 | commodities, computers as, 21 | computations, G-10 to G-12 | | | commodity clusters, 198, H-45 to | finding dependences, G-6 to G-10 | | code division multiple access | H-46, <b>H-45</b> | global code scheduling, 116. G-15 | | (CDMA), D-25 | common case, focusing on, 38 | to G-23, G-16, G-20, G-22 | | code rearrangement, miss rate reduction from, 302 | common data bus (CDB), 93, 95, 96, | Java, K-10 | | reduction from, 202 | 98, <b>101</b> | compilers | | | 70. IUI | computer | | | | | | | | | | compilers (continuea) | chip design and, 1-38 to 1-61, 1-36, | undernow, 1-30 to 1-37, 1-02 | |---------------------------------------|-----------------------------------------------|-----------------------------------------------------------| | multimedia instruction support, | I-59, I-60 | computers, classes of, 4-8 | | B-31 to B-32 | denormalized numbers, I-15, I-20 | condition codes, A-5, A-46, <b>B-19,</b> J-9 | | performance of, B-27, B-29 | to I-21, I-26 to I-27, I-36 | to J-16, J-71 | | recent structures of, B-24 to B-26, | exceptions, I-34 to I-35 | condition registers, B-19 | | B-25 | faster division with one adder, | conditional branch operations | | register allocation in, B-26 to | I-54 to I-58, <b>I-55, I-56, I-57</b> | in control flow, B-19, <b>B-19</b> , <b>B-20</b> | | B-27 | faster multiplication with many | in RISC architecture, J-11 to J-12 | | scheduling, A-66 | adders, I-50 to I-54, <b>I-50 to</b> | J-17, J-34, J-34 | | software pipelining in, G-12 to | I-54 | conditional instructions. See | | G-15, G-13, G-15 | faster multiplication with single | predicated instructions | | speculation, G-28 to G-32 | adders, I-47 to I-50, I-48, | conditional moves, G-23 to G-24 | | complex instruction set computer | I-49 | conflict misses | | (CISC), J-65 | floating-point addition, I-21 to | defined, 290, C-22 | | | I-27, <b>I-24,</b> I-36 | four divisions of, C-24 to C-25 | | component failures, 367 | | relative frequency of, C-22, C-23 | | compulsory misses | floating-point arithmetic, I-13 to | C-24 | | defined, 290, C-22 | I-16, I-21 to I-27, <b>I-24</b> | | | in multiprogramming example, | floating-point multiplication, I-17 | congestion management, E-11, E-12, | | 228, <b>229</b> | to I-21, <b>I-18, I-19, I-20</b> | E-54, E-65 | | relative frequency of, C-22, C-23, | floating-point number | connectedness, E-29 | | C-24 | representation, I-15 to I-16, | Connection Multiprocessor 2, K-35 | | in SMT commercial workloads, | I-16 | connectivity, E-62 to E-63 | | 222, <b>224, 225</b> | floating-point remainder, I-31 to | consistency. See cache coherence | | computation-to-communication ratios, | I-32 | problem; cache coherence | | H-10 to H-12, <b>H-11</b> | fused multiply-add, I-32 to I-33 | protocols; memory | | computer architecture | historical perspectives on, I-62 to | consistency models | | defined, 8, 12, J-84, K-10 | I-65 | constant extension, in RISC | | designing, 12–13, <b>13</b> | instructions in RISC architectures, | architecture, J-6, J-9 | | flawless design fallacy, J-81 | J-22, <b>J-22</b> , <b>J-23</b> , <b>J-24</b> | constellation, H-45 | | functional requirements in, 13 | iterative division, I-27 to I-31, | contention | | historical perspectives on, J-83 to | I-28 | in centralized switched networks | | J-84, K-10 to K-11 | overflow, I-8, I-10 to I-12, I-11, | E-32 | | instruction set architecture, 8-12, | I-20 | congestion from, E-89 | | 9, 11, 12 | in PA-RISC architecture, J-34 to | in network performance, E-25, | | organization and hardware, | J-35, J-36 | E-53 | | 12–15, 13 | pipelining in, I-15 | network topologies and, E-38 | | quantitative design principles, | precision in, <b>I-16,</b> I-21, I-33 to | in routing, E-45, E-47 | | 37–44 | I-34 | in shared-memory | | signed numbers in, I-7 to I-10 | radix-2 multiplication and | multiprocessors, H-29 | | trends in, 14–16, <b>15, 16</b> | division, I-4 to I-7, <b>I-4, I-6,</b> | contention delay, E-25, E-52 | | computer arithmetic, I-1 to I-65 | I-55 to I-58, <b>I-56, I-57</b> | context switch, 316, C-48 | | carry-lookahead adders, I-37 to | ripple-carry adders, I-2 to I-3, I-3, | control dependences, 72–74, 104–105 | | I-41, <b>I-38, I-40, I-41, I-42</b> , | I-42, I-44 | G-16 | | I-44 | shifting over zeros technique, I-45 | control flow instructions, B-16 to B-2 | | carry-propagate adders, I-48 | to I-47, <b>I-46</b> | addressing modes for, B-17 to | | carry-save adders, I-47 to I-48, | signed numbers, I-7 to I-10, I-23, | B-18, <b>B-18</b> | | - | | | | I-48 | I-24, I-26<br>special values in, I-14 to I-15 | conditional branch operations,<br>B-19, <b>B-19, B-20</b> | | carry-select adders, I-43 to I-44, | | in Intel 80x86, J-51 | | I-43, I-44 | subtraction, I-22 to I-23, I-45 | | | carry-skip adders, I-41 to I-43, | systems issues, I-10 to I-13, <b>I-11</b> , | in MIPS architecture, B-37 to B-38, <b>B-38</b> | | I-42, I-44 | I-12 | D-38, <b>D-38</b> | | | | | | procedure invocation options, | Cray C90, <b>F-7</b> , F-32, F-50 | cut-through switching, E-50, E-60, E-74 | |-----------------------------------------------|-----------------------------------------------------|-----------------------------------------| | B-19 to B-20 | Cray J90, F-50<br>Cray SV1, <b>F-7</b> | CYBER 180/990, A-55 | | types of, B-16 to B-17, <b>B-17</b> | • | CYBER 205, F-44, F-48 | | control hazards, A-11, A-21 to A-26, | Cray T3D, E-86 to E-87, <b>E-87</b> , F-50, K-40 | cycle time, 310–311, <b>313</b> | | <b>A-21 to A-26,</b> F-3. See also | | Cydrome Cydra 5, K-22 to K-23 | | branch hazards; pipeline | Cray T3E, 260, K-40 | Cydronie Cydra 5, K-22 to K-25 | | hazards | Cray T90, <b>F-7</b> , F-14, F-50 | <b>n</b> | | control stalls, 74 | Cray T932, F-14 | D | | Convex C-1, <b>F-7</b> , <b>F-34</b> , F-49 | Cray X1 | Dally, Bill, E-1 | | Convex Exemplar, K-41 | characteristics of, F-7 | DAMQ (dynamically allocatable | | convoys, F-10 to F-12, F-13, F-18, | memory in, F-46 | multi-queues), E-56 to E-57 | | <b>F-35</b> , F-39 | multi-streaming processors in. | Darley, H. M., I-58 | | Conway, L., 1-63 | F-43 | DARPA (Defense Advanced Research | | cooling, 19 | processor architecture in, F-40 to | Projects Agency), F-51 | | Coonen, J., I-34 | F-43, <b>F-41</b> , <b>F-42</b> , F-51 | data alignment, B-7 to B-8, <b>B-8</b> | | copy propagation, G-10 to G-11 | Cray X1E, E-20, E-44, E-56, F-44, | data caches, C-9, C-13, C-15, F-46 | | core plus ASIC (system on a chip), | F-51 | data dependences, 68–70, G-16 | | D-3, D-19, <b>D-20</b> | Cray X-MP | data flow | | correlating predictors, 83–86, <b>84, 85,</b> | characteristics of, F-7 | control dependences and, 73–74 | | 87, 88 | innovations in, F-48 | double data rate, 314–315, <b>314</b> | | Cosmic Cube, K-40 | memory pipelines on, F-38 | executions, 105 | | costs, 19–25 | multiple processors in, F-49 | hardware-based speculation and, | | in benchmarks, 375 | peak performance in, F-44 | 105 | | of branches, 80–89, <b>81, 84, 85, 87,</b> | vectorizing compilers in, F-34 | as ILP limitation, 170 | | 88 | Cray XT3, E-20, E-44, E-56 | value prediction and, 170 | | commodities and, 21 | Cray Y-MP, <b>F-7</b> , F-32 to F-33, <b>F-33</b> , | data hazards. See also RAW hazards; | | disk power and, 361 | F-49 to F-50 | WAR hazards; WAW hazards | | of integrated circuits, 21-25, 22, | Cray-1 | 2-cycle stalls, A-59, <b>A-59</b> | | 23 | chaining in, F-23 | minimizing stalls by forwarding, | | in interconnection networks, | characteristics of, F-7 | A-17 to A-18, <b>A-18</b> , A-35, | | <b>E-40,</b> E-89, E-92 | development of, K-12 | A-36, A-37 | | of Internet Archive clusters, | innovations in, F-48 | in MIPS pipelines, A-35 to A-37, | | 394–396 | memory bandwidth in, F-45 | A-38, A-39 | | learning curve and, 19 | peak performance on, F-44 | in pipelining, A-11, A-15 to A-21 | | linear speedups in multiprocessors | register file in, F-5 | A-16, A-18 to A-21 | | and, 259–260, <b>261</b> | Cray-2, <b>F-34,</b> F-46, F-48 | requiring stalls, A-19 to A-20, | | prices vs., 25–28 | Cray-3, F-50 | A-20, A-21 | | of RDRAM, 336, 338 | credit-based flow control, E-10, E-65, | in Tomasulo's approach, 96 | | of transaction-processing servers, | E-71, E-74 | in vector processors, F-2 to F-3, | | 49–50, <b>49</b> | critical path, G-16, G-19 | F-10 | | trends in, 19–25 | critical word first strategy, 299-300, | data miss rates | | of various computing classes, D-4 | 309 | on distributed-memory | | volume and, 20–21 | crossbars, 216, E-30, E-31, E-60 | multiprocessors, H-26 to | | yield and, 19–20, <b>20,</b> 22–24 | cryptanalysis machines, K-4 | H-32, H-28 to H-32 | | count registers, J-32 to J-33 | CSAs (carry-save adders), I-47 to I-48, | hardware-controlled prefetch and | | CPAs (carry-propagate adders), I-48 | I-48, I-50, I-55 | 307–309 | | CPI. See clock cycles per instruction | CSSU (compare, select, and store | in multiprogramming and OS | | CPU time, 28–29, 41–45, C-17 to | units), D-8 | workloads, 228, <b>228, 229</b> | | C-18, C-21 | current frame pointers (CFM), G-33 to | on symmetric shared-memory | | Cray, Seymour, F-1, F-48, F-50 | G-34 | multiprocessors, H-21 to | | Cray arithmetic algorithms, I-64 | custom clusters, 198, H-45 | H-26, <b>H-23 to H-26</b> | | 5.1 T | | , · · · · | | | | | | data parallelism, K-35 | in pipeline hazard prevention, | Dest field, 109 | |-----------------------------------------------|------------------------------------------|-----------------------------------------| | data paths | A-23 to A-25, <b>A-23</b> | destination blocks, H-8 | | for eight-stage pipelines, A-57 to | in restarting execution, A-43 | deterministic routing, E-46, E-53, | | A-59, <b>A-58, A-59</b> | in RISC architectures, J-22, <b>J-22</b> | <b>E-54,</b> E-93 | | in MIPS implementation, A-29 | Dell 2650, <b>322</b> | devices, E-2 | | in MIPS pipelines, A-30 to A-31, | Dell PowerEdge 1600SC, 323 | Dhrystone performance, 30, D-12 | | <b>A-31,</b> A-35, <b>A-37</b> | Dell PowerEdge 2800, <b>47, 48, 49</b> | die yield, 22–24 | | in RISC pipelines, A-7, A-8, A-9 | Dell PowerEdge 2850, <b>47, 48, 49</b> | dies, costs of, 21–25, <b>22, 23</b> | | data races, 245 | Dell Precision Workstation 380, 45 | Digital Alpha. See Alpha | | data rearrangement, miss rate | denormalized numbers, I-15, I-20 to | digital cameras, D-19, <b>D-20</b> | | reduction from, 302 | I-21, I-26 to I-27, I-36 | Digital Equipment Vax, 2 | | data transfer time, 311–313, <b>313</b> | density-optimized processors, E-85 | Digital Linear Tape, K-59 | | data trunks, A-70 | dependability. See reliability | digital signal processors (DSP), D-5 to | | datagrams, E-8, E-83 | dependence analysis, G-6 to G-10 | D-11 | | data-level parallelism, 68, 197, 199 | dependence distance, G-6 | in cell phones, D-23, <b>D-23</b> | | data-race-free programs, 245, K-44 | dependences, 68–75. See also pipeline | defined, D-3 | | DDR (double data rate), 314–315, <b>314</b> | hazards | media extensions, D-10 to D-11, | | dead time, F-31 to F-32, <b>F-31</b> | control, 72-74, 104-105, G-16 | D-11 | | dead values, 74 | data, 68-70, G-16 | multiply-accumulate in, J-19, | | deadlock avoidance, E-45 | eliminating dependent | J-20 | | deadlock recovery, E-46 | computations, G-10 to G-12 | overview, D-5 to D-7, <b>D-6</b> | | deadlocked protocols, 214 | finding, G-6 to G-10 | saturating arithmetic in, D-11 | | deadlocks | greatest common divisor test, G-7 | TI 320C6x, D-8 to D-10, <b>D-9</b> , | | adaptive routing and, E-93 | interprocedural analysis, G-10 | D-10 | | bubble flow control and, E-53 | loop unrolling and, G-8 to G-9 | TI TMS320C55, D-6 to D-8, <b>D-6</b> , | | characteristics of, H-38 | loop-carried, G-3 to G-5 | D-7 | | in dynamic network | name, 70–71 | dimension-order routing, E-46, E-53 | | reconfiguration, E-67 | number of registers to analyze, | DIMMs (dual inline memory | | from limited buffering, H-38 to | 157 | modules), 312, 314, <b>314</b> | | H-40 | recurrences, G-5, G-11 to G-12 | direct addressing mode, B-9 | | in network routing, E-45, E-47, | types of, G-7 to G-8 | direct attached disks, 391 | | E-48 | unnecessary, as ILP limitations, | direct networks, E-34, E-37, E-48, | | DeCegama, Angel, K-37 | 169–170 | E-67, E-92 | | decimal operations, J-35 | depth of pipeline, A-12 | Direct RDRAM, 336, 338 | | decision support systems (DSS), | descriptor privilege level (DPL), C-51 | direct-mapped caches | | 220–221, <b>221, 222</b> | descriptor tables, C-50 to C-51 | block addresses in, C-8, C-8 | | decoding | design faults, 367, 370 | block replacement with cache | | forward error correction codes, | desktop computers | misses, C-9, C-10 | | D-6 | benchmarks for, 30-32 | defined, 289, C-7 to C-8, C-7 | | in RISC instruction set | characteristics of, <b>D-4</b> | development of, K-53 | | implementation, A-5 to A-6 | disk storage on, K-61 to K-62 | size of, 291, <b>292</b> | | in unpipelined MIPS | instruction set principles in, B-2 | directory controllers, H-40 to H-41 | | implementation, A-27 | memory hierarchy in, 341 | directory-based cache coherence | | dedicated link networks, E-5, E-6, <b>E-6</b> | multimedia support for, D-11 | protocols | | Defense Advanced Research Projects | operand type and size in, B-13 to | defined, 208, 231 | | Agency (DARPA), F-51 | B-14 | development of, K-40 to K-41 | | delayed branch schemes | performance and | distributed shared-memory and, | | development of, K-24 | price-performance of, 44–46, | 230–237, <b>232, 233, 235, 236</b> | | in MIPS R4000 pipeline, A-60, | 45, 46 | example, 234–237, <b>235, 236</b> | | A-60 | rise of, 4 | overview of, 231–234, <b>232, 233</b> | | | system characteristics, 5, 5 | | | directory-based multiprocessors, | division | Duato's Protocol, E-47 | |------------------------------------------------------|----------------------------------------|------------------------------------------------| | H-29, <b>H-31</b> | faster, with one adder, I-54 to | dynamic branch frequency, 67 | | dirty bits, C-10, C-44 | I-58, <b>I-55, I-56, I-57</b> | dynamic branch prediction, 82-86, | | Discrete Cosine Transform, D-5 | floating-point remainder, I-31 to | D-4. See also hardware | | Discrete Fourier Transform, D-5 | I-32 | branch prediction | | disk arrays, 362–366, <b>363</b> , <b>365</b> , K-61 | fused multiply-add, I-32 to I-33 | dynamic memory disambiguation. See | | to K-62. See also RAID | iterative, I-27 to I-31, I-28 | memory alias analysis | | disk storage | radix-2 integer, I-4 to I-7, I-4, I-6, | dynamic network reconfiguration, | | areal density in, 358 | I-55 to I-56, <b>I-55</b> | E-67 | | buffers in, 360, <b>360</b> | shifting over zeros technique, I-45 | dynamic power, 18-19 | | development of, K-59 to K-61 | to I-47, <b>I-46</b> | dynamic RAM. See DRAM | | disk arrays, 362–366, <b>363, 365</b> | speed of, I-30 to I-31 | dynamic scheduling, 89–104. See also | | DRAM compared with, 359 | SRT, I-45 to I-47, I-46, I-55 to | Tomasulo's approach | | failure rate of, 50-51 | I-58, <b>I-57</b> | advantages of, 89 | | intelligent interfaces in, 360, 360, | do loops, dependences in, 169 | defined, 89 | | 361 | Dongarra, J. J., F-48 | development of, K-19, K-22 | | power in, 361 | double data rate (DDR), 314-315, 314 | evaluation pitfalls, A-76 | | RAID, K-61 to K-62 (See also | double extended precision, I-16, I-33 | examples of, 97-99, 99, 100 | | RAID) | double precision, A-64, I-16, I-33, | loop-based example, 102-104 | | Tandem disks, 368-369, 370 | J-46 | multiple issue and speculation in, | | technology growth in, 14, | double words, J-50 | 118–121, <b>120, 121</b> | | 358–359, <b>358</b> | double-precision floating-point | overview, 90–92 | | Tertiary Disk project, 368, 369, | operands, A-64 | scoreboarding technique, A-66 to | | 399, <b>399</b> | downtime, cost of, 6 | A-75, <b>A-68</b> , <b>A-71</b> to <b>A-75</b> | | dispatch stage, 95 | DPL (descriptor privilege level), C-51 | Tomasulo's algorithm and, 92-97 | | displacement addressing mode | DRAM (dynamic RAM) | 100–104, <b>101, 103</b> | | in Intel 80x86, J-47 | costs of, 19-20, <b>359</b> | dynamically allocatable multi-queues | | overview, <b>B-9</b> , B-10 to B-11, | DRDRAM, 336, <b>338</b> | (DAMQs), E-56 to E-57 | | B-11, B-12 | embedded, D-16 to D-17, <b>D-16</b> | dynamically shared libraries, B-18 | | display lists, D-17 to D-18 | historical performance of, 312, | | | distributed routing, E-48 | 313 | Ε | | distributed shared-memory (DSM) | memory performance | early restart strategy, 299–300 | | multiprocessors. See also | improvement in, 312–315, | Earth Simulator, F-3 to F-4, F-51 | | multiprocessing | 314 | Ecache, F-41 to F-43 | | cache coherence in, H-36 to H-37 | optimization of, 310 | Eckert, J. Presper, K-2 to K-3, K-5 | | defined, 202 | organization of, 311–312, <b>311</b> | e-cube routing, E-46 | | development of, K-40 | overestimating bandwidth in, 336, | EDN Embedded Microprocessor | | directory-based coherence and, | 338 | Benchmark Consortium | | 230–237, <b>232, 233, 235, 236</b> | redundant memory cells in, 24 | (EEMBC), 30, D-12 to D-13 | | in large-scale multiprocessors, | refresh time in, 312 | D-12, D-13, D-14 | | H-45 | synchronous, 313-314 | EDSAC, K-3 | | latency of memory references in, | technology growth, 14 | EDVAC, K-2 to K-3 | | H-32 | in vector processors, F-46, F-48 | EEMBC benchmarks, 30, D-12 to | | distributed switched networks, E-34 to | DRDRAM (direct RDRAM), 336, 338 | D-13, <b>D-12, D-13, D-14</b> | | E-39, <b>E-36, E-37, E-40,</b> E-46 | driver domains, 321–322, <b>323</b> | effective address, A-4, B-9 | | distributed-memory multiprocessors | DSM. See distributed shared-memory | effective bandwidth | | advantages and disadvantages of, | (DSM) multiprocessors | defined, E-13 | | 201 | DSP. See digital signal processors | in Element Interconnect Bus, | | architecture of, 200-201, 201 | DSS (decision support system), | E-72 | | scientific applications on, H-26 to | 220–221, <b>221, 222</b> | latency and, E-25 to E-29, E-27, | | H-32, <b>H-28 to H-32</b> | dual inline memory modules | E-28 | | | (DIMMs), 312, 314, <b>314</b> | | | | | | ## I-12 \* Index | effective bandwidth (continued) | in TI 320C6x, D-8 to D-10, <b>D-9</b> , | Ethernet switches, 368, 369 | |------------------------------------------|---------------------------------------------------|-------------------------------------------| | network performance and, E-16 to | D-10 | even/odd multipliers, I-52, I-52 | | E-19, <b>E-19</b> , E-25 to E-29, | in TI TMS320C55, D-6 to D-8, | EVEN-ODD scheme, 366 | | <b>E-28,</b> E-89, <b>E-90</b> | D-6, D-7 | EX. See execution/effective address | | network switching and, E-50, | vector instructions in, F-47 | cycle | | E-52 | Emer, Joel, K-7 | exceptions | | network topology and, E-41 | Emotion Engine, SP2, D-15 to D-18, | coerced, A-40 to A-41, A-42 | | packet size and, E-18, E-19 | D-16, D-18 | in computer arithmetic, I-34 to | | effective errors, 367 | encoding, B-21 to B-24 | I-35 | | efficiency, EEMBC benchmarks for, | fixed-length, 10, B-22, <b>B-22</b> | dynamic scheduling and, 91, 95 | | D-13, <b>D-13, D-14</b> | hybrid, <b>B-22</b> , <b>B</b> -23 | floating-point, A-43 | | efficiency factor, E-52, E-55 | in packet transport, E-9 | inexact, I-35 | | EIB (Element Interconnect Bus), E-3, | reduced code size in RISCs, B-23 | instruction set complications, | | E-70, <b>E-71</b> | to B-24 | A-45 to A-47 | | eigenvalue method, H-8 | variable-length, 10, B-22 to B-23, | invalid, I-35 | | eight-way conflict misses, C-24 | B-22 | in MIPS pipelining, A-38 to A-41, | | 80x86 processors. See Intel 80x86 | in VAX, J-68 to J-70, <b>J-69</b> | <b>A-40</b> , <b>A-42</b> , A-43 to A-45, | | ElanSC520, D-13, <b>D-13</b> | end-to-end flow control, E-65, E-94 to | A-44 | | elapsed time, 28. See also latency | E-95 | order of instruction, A-38 to A-41, | | Element Interconnect Bus (EIB), E-3, | energy efficiency, 182 | A-40, A-42 | | E-70, E-71 | EnergyBench, D-13, <b>D-13</b> | precise exceptions, A-43, A-54 to | | embedded systems, D-1 to D-26 | Engineering Research Associates | A-56 | | benchmarks in, D-12 to D-13, | (ERA), K-4 | preserving, in compiler | | D-12, D-13, D-14 | ENIAC (Electronic Numerical | speculation, G-27 to G-31 | | cell phones, D-20 to D-25, <b>D-21</b> , | Integrator and Calculator), | program order and, 73–74 | | D-23, D-24 | K-2, K-59 | restarting execution, A-41 to A-43 | | characteristics of, D-4 | environmental faults, 367, 369, 370 | underflow, I-36 to I-37, I-62 | | costs of, 5 | EPIC (Explicitly Parallel Instruction | exclusion policy, in AMD Opteron, | | data addressing modes in, J-5 to | Computer), 114, <b>115</b> , 118, | 329, 330 | | J-6, <b>J-6</b> | G-33, K-24 | exclusive cache blocks, 210–211 | | defined, 5 | ERA (Engineering Research | execution time, 28, 257–258, C-3 to | | digital signal processors in, J-19 | Associates), K-4 | C-4. See also response time | | instruction set principles in, 4, B-2 | error latency, 366–367 | execution trace cache, 131, 132, 133 | | media extensions in, D-10 to | errors | execution/effective address cycle (EX) | | D-11, <b>D-11</b> | bit error rate, D-21 to D-22 | in floating-point MIPS pipelining, | | MIPS extensions in, J-19 to J-24, | effective, 367 | A-47 to A-49, <b>A-48</b> | | J-23, J-24 | forward error correction codes, | in RISC instruction set, A-6 | | multiprocessors, D-3, D-14 to | D-6 | in unpipelined MIPS | | D-15 | latent, 366–367 | implementation, A-27 to | | overview, 7–8 | meaning of, 366–367 | A-28, <b>A-29</b> | | power consumption and efficiency | round-off, D-6, <b>D-6</b> | expand-down field, C-51 | | in, D-13, <b>D-13</b> | escape path, E-46 | explicit parallelism, G-34 to G-37, | | real-time constraints in, D-2 | escape resource set, E-47 | G-35, G-36, G-37 | | real-time processing in, D-3 to | eServer p5 595, <b>47</b> , <b>48</b> , <b>49</b> | Explicitly Parallel Instruction | | D-5 | Eshraghian, K., I-65 | Computer (EPIC), 114, 115, | | reduced code size in RISCs, B-23 | ETA-10, <b>F-34</b> , F-49 | 118, G-33, K-24 | | to B-24 | Ethernet | exponential back-off, H-17 to H-18, | | in Sanyo VPC-SX500 digital | as local area network, E-4 | H-17 | | camera, D-19, <b>D-20</b> | overview of, E-77 to E-79, E-78 | exponential distributions, 383–384, | | in Sony Playstation 2, D-15 to | packet format in, E-75 | 386. See also Poisson | | D-18, <b>D-16, D-18</b> | performance, E-89, <b>E-90</b> | distribution | | | as shared-media network, E-23 | exponents, I-15 to I-16, <b>I-16</b> | | B-3, J-45 | C-10 | representation of floating-point | |--------------------------------------------|----------------------------------------------|------------------------------------------| | | | | | extended precision, I-33 to I-34 | file server benchmarks, 32 | numbers, I-15 to I-16, I-16 | | extended stack architecture, J-45 | filers, 391, 397–398 | in SPARC architecture, J-31 to | | | fine-grained multithreading, 173-175, | J-32 | | F | 174. See also multithreading | special values in, I-14 to I-15 | | failure, defined, 366-367 | finite-state controllers, 211 | subtraction, I-22 to I-23 | | failure rates, 26–28, 41, 50–51 | first in, first out (FIFO), 382, C-9, | underflow, I-36 to I-37, I-62 | | failures in time (FIT), 26–27 | C-10 | floating-point operations. See also | | fairness, E-23, E-49, H-13 | first-reference misses, C-22 | floating-point arithmetic | | false sharing misses | Fisher, J., 153, K-21 | blocked floating point, D-6 | | in SMT commercial workloads, | FIT (failures in time), 26–27 | conditional branch options, <b>B-20</b> | | 222, <b>224, 225</b> | five nines (99.999%) claim, 399 | instruction operators in, B-15 | | | fixed point computations, I-13 | latencies of, 75, <b>75</b> | | in symmetric shared-memory | fixed-field decoding, A-6 | maintaining precise exceptions, | | multiprocessors, 218–219, | fixed-length encoding, 10, B-22, <b>B-22</b> | A-54 to A-56 | | 224 | fixed-point arithmetic, D-5 to D-6 | in media extensions, D-10 | | fast page mode, 313 | flash memory, 359–360 | memory addressing in, <b>B-12</b> , | | fat trees, <b>E-33</b> , E-34, E-36, E-38, | flexible chaining, F-24 to F-25 | B-13 | | <b>E-40,</b> E-48 | flit, E-51, E-58, E-61 | in MIPS architecture, B-38 to | | fault detection, 51–52 | | • | | fault tolerance, IEEE on, 366 | Floating Point Systems AP-120B, | B-39, <b>B-40</b> | | faulting prefetches, 306 | K-21 | MIPS pipelining in, A-47 to A-56, | | faults. See also exceptions | floating-point arithmetic. See also | A-48 to A-51, A-57, A-58 | | address, C-40 | floating-point operations | MIPS R4000 pipeline example, | | categories of, 367, 370 | addition in, I-21 to I-27, I-24, I-36 | A-60 to A-65, <b>A-61</b> to <b>A-65</b> | | design, 367, <b>370</b> | in Alpha, J-29 | multicore processor comparisons, | | environmental, 367, 369, <b>370</b> | chip design and, I-58 to I-61, <b>I-58</b> , | 255 | | hardware, 367, <b>370</b> | 1-59, 1-60 | nonblocking caches and, 297–298 | | intermittent, 367 | conversions to integer arithmetic, | operand types and sizes, B-13 to | | meaning of, 366–367 | I-62 | B-14, <b>B-15</b> | | page, C-3, C-40 | denormalized numbers, I-15, I-20 | paired single operations and, D-10 | | permanent, 367 | to I-21, I-26 to I-27, I-36 | to D-11 | | transient, 367, 378-379 | development of, K-4 to K-5 | parallelism and, 161–162, <b>162</b> , | | fault-tolerant routing, E-66 to E-68, | exceptions in, A-43, I-34 to I-35 | <b>166,</b> 167 | | <b>E-69,</b> E-74, E-94 | fused multiply-add, I-32 to I-33 | performance growth since | | FCC (Federal Communications | historical perspectives on, I-62 to | mid-1980s, <b>3</b> | | Commission), 371 | I-65 | scoreboarding, A-66 to A-75, | | feature size, 17 | in IBM 360, J-85 to J-86, <b>J-85</b> , | A-68, A-71 to A-75 | | Federal Communications Commission | J-86, J-87 | in Tomasulo's approach, 94, 94, | | (FCC), 371 | IEEE standard for, I-13 to I-14, | 107 | | Feng, Tse-Yun, E-1 | I-16 | in vector processors, F-4, F-6, | | fetch-and-increment synchronization | instructions in RISC architectures, | <b>F-8</b> , F-11 | | primitive, 239-240, H-20 to | J-23 | floating-point registers (FPRs), B-34, | | H-21, <b>H-21</b> | in Intel 80x86, J-52 to J-55, <b>J-54</b> , | B-36 | | FFT kernels | J-61 | floating-point status register, B-34 | | characteristics of, H-7, H-11 | iterative division, I-27 to I-31, | floppy disks, K-60 | | on distributed-memory | I-28 | flow control | | multiprocessors, H-27 to | in MIPS 64, J-27 | bubble, E-53, E-73 | | H-29, <b>H-28 to H-32</b> | multiplication, I-17 to I-21, I-18, | in buffer overflow prevention, | | on symmetric shared-memory | I-19, I-20 | E-22 | | multiprocessors, H-21 to | pipelining in, I-15 | in congestion management, E-65 | | H-26, <b>H-23 to H-26</b> | precision in, I-21, I-33 to I-34 | | | | | | | flow control ( <i>continued</i> ) credit-based, E-10, E-65, E-71, | full adders, I-2 to I-3, I-3<br>full bisection bandwidth, E-39, E-41 | global optimizations, B-26, <b>B-28</b> global scheduling, 116 | |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | E-74 | full-duplex mode, E-22 | global system for mobile | | defined, E-10 | fully associative caches, 289, C-7, | communication (GSM), D-25 | | in distributed switched networks, | C-7, C-25 | global/stack analysis, 164–165, <b>164</b> | | E-38 | fully connected, E-34, E-40 | Goldberg, D., I-34 | | end-to-end, E-65 | function pointers, register indirect | Goldberg, I. B., I-64 | | link-level, E-58, E-62, E-65, E-72, | jumps for, B-18 | Goldberg, Robert, 315 | | E-74 | fused multiply-add, I-32 to I-33 | Goldschmidt's algorithm, I-29, I-30, | | in lossless networks, E-11 | future file, A-55 | I-61 | | network performance and, E-17 | | Goldstine, H. H., 287, I-62, K-2 to K-3 | | Stop & Go, E-10 | G | Google, E-85 | | switching and, E-51 | galaxy evolution, H-8 to H-9 | GPR (general-purpose registers), | | Xon/Xoff, E-10 | gallium arsenide, F-46, F-50 | B-34, G-38 | | flow-balanced state, 379 | gateways, E-79 | GPR computers, B-3 to B-6, <b>B-4</b> , <b>B-6</b> | | flush pipeline scheme, A-22, A-25 | gather operations, F-27 | gradual underflow, I-15, I-36 | | FM (frequency modulations), D-21 | gather/scatter addressing, B-31 | grain size, defined, 199 | | form factor, E-9 | GCD (greatest common divisor) test, | graph coloring, B-26 to B-27 | | FORTRAN | G-7 | greatest common divisor (GCD) test, | | integer division and remainder in, | general-purpose register (GPR) | G-7 | | I-12 | computers, B-3 to B-6, <b>B-4</b> , | grid, E-36 | | vector processors in, F-17, F-21,<br>F-33, <b>F-34,</b> F-44 to F-45, | B-6 | GSM (global system for mobile communication), D-25 | | F-45 | general-purpose registers (GPRs), | guest domains, 321–322, <b>323</b> | | forward error correction codes, D-6 | B-34, G-38 | guests, in virtual machines, 319–320, | | forward path, in cell phone base | GENI (Global Environment for | 321 | | stations, D-24 | Network Innovation), E-98 | 321 | | | | | | · · · · · · · · · · · · · · · · · · · | geometric standard deviation, 36, 37 | ш | | forwarding | geometric standard deviation, 36-37 | H | | forwarding chaining, F-23 to F-25, <b>F-24</b> | geometric standard deviation, 36-37<br>Gibson instruction mix, K-6 | hackers, J-65 | | forwarding<br>chaining, F-23 to F-25, <b>F-24</b><br>in longer latency pipelines, A-49 | geometric standard deviation, 36-37<br>Gibson instruction mix, K-6<br>Gilder, George, 357 | hackers, J-65<br>half adders, I-2 to I-3 | | forwarding chaining, F-23 to F-25, <b>F-24</b> | geometric standard deviation, 36-37<br>Gibson instruction mix, K-6<br>Gilder, George, 357<br>global address space, C-50 | hackers, J-65<br>half adders, I-2 to I-3<br>half-duplex mode, E-22 | | forwarding<br>chaining, F-23 to F-25, <b>F-24</b><br>in longer latency pipelines, A-49<br>to A-54, <b>A-50</b> , <b>A-51</b> | geometric standard deviation, 36-37<br>Gibson instruction mix, K-6<br>Gilder, George, 357 | hackers, J-65<br>half adders, I-2 to I-3<br>half-duplex mode, E-22<br>half-words, B-13, B-34 | | forwarding<br>chaining, F-23 to F-25, <b>F-24</b><br>in longer latency pipelines, A-49<br>to A-54, <b>A-50</b> , <b>A-51</b><br>minimizing data hazard stalls by, | geometric standard deviation, 36-37<br>Gibson instruction mix, K-6<br>Gilder, George, 357<br>global address space, C-50<br>global code motion, G-16 to G-19,<br>G-17 | hackers, J-65<br>half adders, I-2 to I-3<br>half-duplex mode, E-22<br>half-words, B-13, B-34<br>handshaking, E-10 | | forwarding<br>chaining, F-23 to F-25, <b>F-24</b><br>in longer latency pipelines, A-49<br>to A-54, <b>A-50</b> , <b>A-51</b><br>minimizing data hazard stalls by,<br>A-17 to A-18, <b>A-18</b> | geometric standard deviation, 36-37<br>Gibson instruction mix, K-6<br>Gilder, George, 357<br>global address space, C-50<br>global code motion, G-16 to G-19,<br>G-17<br>global code scheduling, G-15 to G-23 | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 | | forwarding chaining, F-23 to F-25, <b>F-24</b> in longer latency pipelines, A-49 to A-54, <b>A-50</b> , <b>A-51</b> minimizing data hazard stalls by, A-17 to A-18, <b>A-18</b> in MIPS pipelines, A-35, <b>A-36</b> , A-37, A-59, <b>A-59</b> forwarding logic, 89 | geometric standard deviation, 36-37<br>Gibson instruction mix, K-6<br>Gilder, George, 357<br>global address space, C-50<br>global code motion, G-16 to G-19,<br>G-17 | hackers, J-65<br>half adders, I-2 to I-3<br>half-duplex mode, E-22<br>half-words, B-13, B-34<br>handshaking, E-10 | | forwarding chaining, F-23 to F-25, <b>F-24</b> in longer latency pipelines, A-49 to A-54, <b>A-50</b> , <b>A-51</b> minimizing data hazard stalls by, A-17 to A-18, <b>A-18</b> in MIPS pipelines, A-35, <b>A-36</b> , A-37, A-59, <b>A-59</b> forwarding logic, 89 forwarding tables, E-48, E-57, E-60, | geometric standard deviation, 36-37 Gibson instruction mix, K-6 Gilder, George, 357 global address space, C-50 global code motion, G-16 to G-19, G-17 global code scheduling, G-15 to G-23 control and data dependences in, | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 hardware, defined, 12 | | forwarding chaining, F-23 to F-25, <b>F-24</b> in longer latency pipelines, A-49 to A-54, <b>A-50</b> , <b>A-51</b> minimizing data hazard stalls by, A-17 to A-18, <b>A-18</b> in MIPS pipelines, A-35, <b>A-36</b> , A-37, A-59, <b>A-59</b> forwarding logic, 89 forwarding tables, E-48, E-57, E-60, E-67, E-74 | geometric standard deviation, 36-37 Gibson instruction mix, K-6 Gilder, George, 357 global address space, C-50 global code motion, G-16 to G-19, G-17 global code scheduling, G-15 to G-23 control and data dependences in, G-16 | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 hardware, defined, 12 hardware branch prediction, 80–89 | | forwarding chaining, F-23 to F-25, <b>F-24</b> in longer latency pipelines, A-49 to A-54, <b>A-50</b> , <b>A-51</b> minimizing data hazard stalls by, A-17 to A-18, <b>A-18</b> in MIPS pipelines, A-35, <b>A-36</b> , A-37, A-59, <b>A-59</b> forwarding logic, 89 forwarding tables, E-48, E-57, E-60, E-67, E-74 Fourier-Motzkin test, K-23 | geometric standard deviation, 36-37 Gibson instruction mix, K-6 Gilder, George, 357 global address space, C-50 global code motion, G-16 to G-19, G-17 global code scheduling, G-15 to G-23 control and data dependences in, G-16 global code motion, G-16 to G-19, | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 hardware, defined, 12 hardware branch prediction, 80–89 branch-prediction buffers and, | | forwarding chaining, F-23 to F-25, <b>F-24</b> in longer latency pipelines, A-49 to A-54, <b>A-50</b> , <b>A-51</b> minimizing data hazard stalls by, A-17 to A-18, <b>A-18</b> in MIPS pipelines, A-35, <b>A-36</b> , A-37, A-59, <b>A-59</b> forwarding logic, 89 forwarding tables, E-48, E-57, E-60, E-67, E-74 Fourier-Motzkin test, K-23 four-way conflict misses, C-24 | geometric standard deviation, 36-37 Gibson instruction mix, K-6 Gilder, George, 357 global address space, C-50 global code motion, G-16 to G-19, G-17 global code scheduling, G-15 to G-23 control and data dependences in, G-16 global code motion, G-16 to G-19, G-17 overview of, G-16, G-16 predication with, G-24 | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 hardware, defined, 12 hardware branch prediction, 80–89 branch-prediction buffers and, 82–86, 83, 84, 85 branch-target buffers, 122–125, 122, 124 | | forwarding chaining, F-23 to F-25, <b>F-24</b> in longer latency pipelines, A-49 to A-54, <b>A-50</b> , <b>A-51</b> minimizing data hazard stalls by, A-17 to A-18, <b>A-18</b> in MIPS pipelines, A-35, <b>A-36</b> , A-37, A-59, A-59 forwarding logic, 89 forwarding tables, E-48, E-57, E-60, E-67, E-74 Fourier-Motzkin test, K-23 four-way conflict misses, C-24 FP. See floating-point arithmetic; | geometric standard deviation, 36-37 Gibson instruction mix, K-6 Gilder, George, 357 global address space, C-50 global code motion, G-16 to G-19, G-17 global code scheduling, G-15 to G-23 control and data dependences in, G-16 global code motion, G-16 to G-19, G-17 overview of, G-16, G-16 predication with, G-24 superblocks, G-21 to G-23, G-22 | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 hardware, defined, 12 hardware branch prediction, 80–89 branch-prediction buffers and, 82–86, 83, 84, 85 branch-target buffers, 122–125, 122, 124 correlating predictors, 83–86, 84, | | forwarding chaining, F-23 to F-25, <b>F-24</b> in longer latency pipelines, A-49 to A-54, <b>A-50</b> , <b>A-51</b> minimizing data hazard stalls by, A-17 to A-18, <b>A-18</b> in MIPS pipelines, A-35, <b>A-36</b> , A-37, A-59, A-59 forwarding logic, 89 forwarding tables, E-48, E-57, E-60, E-67, E-74 Fourier-Motzkin test, K-23 four-way conflict misses, C-24 FP. See floating-point arithmetic; floating-point operations | geometric standard deviation, 36-37 Gibson instruction mix, K-6 Gilder, George, 357 global address space, C-50 global code motion, G-16 to G-19, G-17 global code scheduling, G-15 to G-23 control and data dependences in, G-16 global code motion, G-16 to G-19, G-17 overview of, G-16, G-16 predication with, G-24 superblocks, G-21 to G-23, G-22 trace scheduling, G-19 to G-21, | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 hardware, defined, 12 hardware branch prediction, 80–89 branch-prediction buffers and, 82–86, 83, 84, 85 branch-target buffers, 122–125, 122, 124 correlating predictors, 83–86, 84, 85, 87, 88 | | forwarding chaining, F-23 to F-25, <b>F-24</b> in longer latency pipelines, A-49 to A-54, <b>A-50</b> , <b>A-51</b> minimizing data hazard stalls by, A-17 to A-18, <b>A-18</b> in MIPS pipelines, A-35, <b>A-36</b> , A-37, A-59, <b>A-59</b> forwarding logic, 89 forwarding tables, E-48, E-57, E-60, E-67, E-74 Fourier-Motzkin test, K-23 four-way conflict misses, C-24 FP. See floating-point arithmetic; floating-point operations FPRs (floating-point registers), B-34, | geometric standard deviation, 36-37 Gibson instruction mix, K-6 Gilder, George, 357 global address space, C-50 global code motion, G-16 to G-19, G-17 global code scheduling, G-15 to G-23 control and data dependences in, G-16 global code motion, G-16 to G-19, G-17 overview of, G-16, G-16 predication with, G-24 superblocks, G-21 to G-23, G-22 trace scheduling, G-19 to G-21, G-20 | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 hardware, defined, 12 hardware branch prediction, 80–89 branch-prediction buffers and, 82–86, 83, 84, 85 branch-target buffers, 122–125, 122, 124 correlating predictors, 83–86, 84, 85, 87, 88 development of, K-20 | | forwarding chaining, F-23 to F-25, <b>F-24</b> in longer latency pipelines, A-49 to A-54, <b>A-50</b> , <b>A-51</b> minimizing data hazard stalls by, A-17 to A-18, <b>A-18</b> in MIPS pipelines, A-35, <b>A-36</b> , A-37, A-59, <b>A-59</b> forwarding logic, 89 forwarding tables, E-48, E-57, E-60, E-67, E-74 Fourier-Motzkin test, K-23 four-way conflict misses, C-24 FP. See floating-point arithmetic; floating-point operations FPRs (floating-point registers), B-34, B-36 | geometric standard deviation, 36-37 Gibson instruction mix, K-6 Gilder, George, 357 global address space, C-50 global code motion, G-16 to G-19, G-17 global code scheduling, G-15 to G-23 control and data dependences in, G-16 global code motion, G-16 to G-19, G-17 overview of, G-16, G-16 predication with, G-24 superblocks, G-21 to G-23, G-22 trace scheduling, G-19 to G-21, G-20 in VLIW, 116 | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 hardware, defined, 12 hardware branch prediction, 80–89 branch-prediction buffers and, 82–86, 83, 84, 85 branch-target buffers, 122–125, 122, 124 correlating predictors, 83–86, 84, 85, 87, 88 development of, K-20 effects of branch prediction | | forwarding chaining, F-23 to F-25, F-24 in longer latency pipelines, A-49 to A-54, A-50, A-51 minimizing data hazard stalls by, A-17 to A-18, A-18 in MIPS pipelines, A-35, A-36, A-37, A-59, A-59 forwarding logic, 89 forwarding tables, E-48, E-57, E-60, E-67, E-74 Fourier-Motzkin test, K-23 four-way conflict misses, C-24 FP. See floating-point arithmetic; floating-point operations FPRs (floating-point registers), B-34, B-36 fragment field, E-84 | geometric standard deviation, 36-37 Gibson instruction mix, K-6 Gilder, George, 357 global address space, C-50 global code motion, G-16 to G-19, G-17 global code scheduling, G-15 to G-23 control and data dependences in, G-16 global code motion, G-16 to G-19, G-17 overview of, G-16, G-16 predication with, G-24 superblocks, G-21 to G-23, G-22 trace scheduling, G-19 to G-21, G-20 in VLIW, 116 global collective networks, H-42, | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 hardware, defined, 12 hardware branch prediction, 80–89 branch-prediction buffers and, 82–86, 83, 84, 85 branch-target buffers, 122–125, 122, 124 correlating predictors, 83–86, 84, 85, 87, 88 development of, K-20 effects of branch prediction schemes, 160–162, 160, 162 | | forwarding chaining, F-23 to F-25, F-24 in longer latency pipelines, A-49 to A-54, A-50, A-51 minimizing data hazard stalls by, A-17 to A-18, A-18 in MIPS pipelines, A-35, A-36, A-37, A-59, A-59 forwarding logic, 89 forwarding tables, E-48, E-57, E-60, E-67, E-74 Fourier-Motzkin test, K-23 four-way conflict misses, C-24 FP. See floating-point arithmetic; floating-point operations FPRs (floating-point registers), B-34, B-36 fragment field, E-84 Frank, S. J., K-39 | geometric standard deviation, 36-37 Gibson instruction mix, K-6 Gilder, George, 357 global address space, C-50 global code motion, G-16 to G-19, G-17 global code scheduling, G-15 to G-23 control and data dependences in, G-16 global code motion, G-16 to G-19, G-17 overview of, G-16, G-16 predication with, G-24 superblocks, G-21 to G-23, G-22 trace scheduling, G-19 to G-21, G-20 in VLIW, 116 global collective networks, H-42, H-43 | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 hardware, defined, 12 hardware branch prediction, 80–89 branch-prediction buffers and, 82–86, 83, 84, 85 branch-target buffers, 122–125, 122, 124 correlating predictors, 83–86, 84, 85, 87, 88 development of, K-20 effects of branch prediction schemes, 160–162, 160, 162 in ideal processor, 155, 160–162, | | forwarding chaining, F-23 to F-25, F-24 in longer latency pipelines, A-49 to A-54, A-50, A-51 minimizing data hazard stalls by, A-17 to A-18, A-18 in MIPS pipelines, A-35, A-36, A-37, A-59, A-59 forwarding logic, 89 forwarding tables, E-48, E-57, E-60, E-67, E-74 Fourier-Motzkin test, K-23 four-way conflict misses, C-24 FP. See floating-point arithmetic; floating-point operations FPRs (floating-point registers), B-34, B-36 fragment field, E-84 Frank, S. J., K-39 freeze pipeline scheme, A-22 | geometric standard deviation, 36-37 Gibson instruction mix, K-6 Gilder, George, 357 global address space, C-50 global code motion, G-16 to G-19, G-17 global code scheduling, G-15 to G-23 control and data dependences in, G-16 global code motion, G-16 to G-19, G-17 overview of, G-16, G-16 predication with, G-24 superblocks, G-21 to G-23, G-22 trace scheduling, G-19 to G-21, G-20 in VLIW, 116 global collective networks, H-42, H-43 global common subexpression | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 hardware, defined, 12 hardware branch prediction, 80–89 branch-prediction buffers and, 82–86, 83, 84, 85 branch-target buffers, 122–125, 122, 124 correlating predictors, 83–86, 84, 85, 87, 88 development of, K-20 effects of branch prediction schemes, 160–162, 160, 162 in ideal processor, 155, 160–162, 160, 162 | | forwarding chaining, F-23 to F-25, F-24 in longer latency pipelines, A-49 to A-54, A-50, A-51 minimizing data hazard stalls by, A-17 to A-18, A-18 in MIPS pipelines, A-35, A-36, A-37, A-59, A-59 forwarding logic, 89 forwarding tables, E-48, E-57, E-60, E-67, E-74 Fourier-Motzkin test, K-23 four-way conflict misses, C-24 FP. See floating-point arithmetic; floating-point operations FPRs (floating-point registers), B-34, B-36 fragment field, E-84 Frank, S. J., K-39 freeze pipeline scheme, A-22 Freiman, C. V., I-63 | geometric standard deviation, 36-37 Gibson instruction mix, K-6 Gilder, George, 357 global address space, C-50 global code motion, G-16 to G-19, G-17 global code scheduling, G-15 to G-23 control and data dependences in, G-16 global code motion, G-16 to G-19, G-17 overview of, G-16, G-16 predication with, G-24 superblocks, G-21 to G-23, G-22 trace scheduling, G-19 to G-21, G-20 in VLIW, 116 global collective networks, H-42, H-43 global common subexpression elimination, B-26, B-28 | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 hardware, defined, 12 hardware branch prediction, 80–89 branch-prediction buffers and, 82–86, 83, 84, 85 branch-target buffers, 122–125, 122, 124 correlating predictors, 83–86, 84, 85, 87, 88 development of, K-20 effects of branch prediction schemes, 160–162, 160, 162 in ideal processor, 155, 160–162, 160, 162 integrated instruction fetch units | | forwarding chaining, F-23 to F-25, F-24 in longer latency pipelines, A-49 to A-54, A-50, A-51 minimizing data hazard stalls by, A-17 to A-18, A-18 in MIPS pipelines, A-35, A-36, A-37, A-59, A-59 forwarding logic, 89 forwarding tables, E-48, E-57, E-60, E-67, E-74 Fourier-Motzkin test, K-23 four-way conflict misses, C-24 FP. See floating-point arithmetic; floating-point operations FPRs (floating-point registers), B-34, B-36 fragment field, E-84 Frank, S. J., K-39 freeze pipeline scheme, A-22 Freiman, C. V., I-63 frequency modulations (FM), D-21 | geometric standard deviation, 36-37 Gibson instruction mix, K-6 Gilder, George, 357 global address space, C-50 global code motion, G-16 to G-19, G-17 global code scheduling, G-15 to G-23 control and data dependences in, G-16 global code motion, G-16 to G-19, G-17 overview of, G-16, G-16 predication with, G-24 superblocks, G-21 to G-23, G-22 trace scheduling, G-19 to G-21, G-20 in VLIW, 116 global collective networks, H-42, H-43 global common subexpression elimination, B-26, B-28 global data area, in compilers, B-27 | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 hardware, defined, 12 hardware branch prediction, 80-89 branch-prediction buffers and, 82-86, 83, 84, 85 branch-target buffers, 122-125, 122, 124 correlating predictors, 83-86, 84, 85, 87, 88 development of, K-20 effects of branch prediction schemes, 160-162, 160, 162 in ideal processor, 155, 160-162, 160, 162 integrated instruction fetch units and, 126-127 | | forwarding chaining, F-23 to F-25, F-24 in longer latency pipelines, A-49 to A-54, A-50, A-51 minimizing data hazard stalls by, A-17 to A-18, A-18 in MIPS pipelines, A-35, A-36, A-37, A-59, A-59 forwarding logic, 89 forwarding tables, E-48, E-57, E-60, E-67, E-74 Fourier-Motzkin test, K-23 four-way conflict misses, C-24 FP. See floating-point arithmetic; floating-point operations FPRs (floating-point registers), B-34, B-36 fragment field, E-84 Frank, S. J., K-39 freeze pipeline scheme, A-22 Freiman, C. V., I-63 | geometric standard deviation, 36-37 Gibson instruction mix, K-6 Gilder, George, 357 global address space, C-50 global code motion, G-16 to G-19, G-17 global code scheduling, G-15 to G-23 control and data dependences in, G-16 global code motion, G-16 to G-19, G-17 overview of, G-16, G-16 predication with, G-24 superblocks, G-21 to G-23, G-22 trace scheduling, G-19 to G-21, G-20 in VLIW, 116 global collective networks, H-42, H-43 global common subexpression elimination, B-26, B-28 global data area, in compilers, B-27 Global Environment for Network | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 hardware, defined, 12 hardware branch prediction, 80–89 branch-prediction buffers and, 82–86, 83, 84, 85 branch-target buffers, 122–125, 122, 124 correlating predictors, 83–86, 84, 85, 87, 88 development of, K-20 effects of branch prediction schemes, 160–162, 160, 162 in ideal processor, 155, 160–162, 160, 162 integrated instruction fetch units and, 126–127 in Pentium 4, 132–134, 134 | | forwarding chaining, F-23 to F-25, F-24 in longer latency pipelines, A-49 to A-54, A-50, A-51 minimizing data hazard stalls by, A-17 to A-18, A-18 in MIPS pipelines, A-35, A-36, A-37, A-59, A-59 forwarding logic, 89 forwarding tables, E-48, E-57, E-60, E-67, E-74 Fourier-Motzkin test, K-23 four-way conflict misses, C-24 FP. See floating-point arithmetic; floating-point operations FPRs (floating-point registers), B-34, B-36 fragment field, E-84 Frank, S. J., K-39 freeze pipeline scheme, A-22 Freiman, C. V., I-63 frequency modulations (FM), D-21 Fujitsu VP100/VP200, F-7, F-49, F-50 | geometric standard deviation, 36-37 Gibson instruction mix, K-6 Gilder, George, 357 global address space, C-50 global code motion, G-16 to G-19, G-17 global code scheduling, G-15 to G-23 control and data dependences in, G-16 global code motion, G-16 to G-19, G-17 overview of, G-16, G-16 predication with, G-24 superblocks, G-21 to G-23, G-22 trace scheduling, G-19 to G-21, G-20 in VLIW, 116 global collective networks, H-42, H-43 global common subexpression elimination, B-26, B-28 global data area, in compilers, B-27 | hackers, J-65 half adders, I-2 to I-3 half-duplex mode, E-22 half-words, B-13, B-34 handshaking, E-10 hard real-time systems, D-3 to D-4 hardware, defined, 12 hardware branch prediction, 80-89 branch-prediction buffers and, 82-86, 83, 84, 85 branch-target buffers, 122-125, 122, 124 correlating predictors, 83-86, 84, 85, 87, 88 development of, K-20 effects of branch prediction schemes, 160-162, 160, 162 in ideal processor, 155, 160-162, 160, 162 integrated instruction fetch units and, 126-127 | | 160, 161, 162, K-20 trace caches and, 296 hardware pare scription notation, J-25 hardware description notation, J-25 hardware faults, 367, 370 hardware prefetching. See prefetching hardware-based speculation, 104–114. See also speculation data flow value prediction, 170 limitations of, 170–171, 182–183 reorder buffer in, 106–114, 107, 110, 111, 113 Tomasulo's approach and, 105–109 Harvard architecture, K-4 hazards. See pipeline hazards; RAW hazards; WAR h | tournament predictors, 86–89, | on computer arithmetic, I-62 to | Hopkins, M., G-1 | |--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------|------------------------------------|-----------------------------------------| | hardware description notation, J-25 hardware faults, 267, 379 hardware prefetching. See prefetching hardware prefetching. See prefetching hardware prefetching. See prefetching hardware prefetching. See prefetching hardware prefetching. See prefetching hardware-based speculation. 104–114. See alzo speculation data flow value prediction, 170 limitations of, 170–171, 182–183 reorder buffer in, 106–114, 107, 110, 111, 113 Tomasulo's approach and, 105–109 Harvard architecture, K-4 hazards, See pipeline hazards; RAW hazards WAR hazards; WAW hazards WAR hazards; WAW hazards head-of-line (HOL) blocking adaptive routing and, E-54 buffer organizations and, E-58, E-59 buffered crossbar switches and, E-62 congestion management and, E-64 virtual channels and, E-93 head, sin compilers, B-27 to B-28 head dissipation, 19 helical scan, K-59 Hennessy, J., K-12 to K-13 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level anguage computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level anguage computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 HOagland, Al. 357 HOL. See head-of-line (HOL) blocking histograms, 382–383 histograms, 382–383 histograms, 382–383 histograms, 382–383 histograms, 382–381 histograms, 382–383 histog | <b>160,</b> 161, <b>162,</b> K-20 | I-65 | hosts, in virtual machines, 318 | | hardware description notation, J-25 hardware faults, 267, 379 hardware prefetching. See prefetching hardware prefetching. See prefetching hardware prefetching. See prefetching hardware prefetching. See prefetching hardware prefetching. See prefetching hardware-based speculation. 104–114. See alzo speculation data flow value prediction, 170 limitations of, 170–171, 182–183 reorder buffer in, 106–114, 107, 110, 111, 113 Tomasulo's approach and, 105–109 Harvard architecture, K-4 hazards, See pipeline hazards; RAW hazards WAR hazards; WAW hazards WAR hazards; WAW hazards head-of-line (HOL) blocking adaptive routing and, E-54 buffer organizations and, E-58, E-59 buffered crossbar switches and, E-62 congestion management and, E-64 virtual channels and, E-93 head, sin compilers, B-27 to B-28 head dissipation, 19 helical scan, K-59 Hennessy, J., K-12 to K-13 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level anguage computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level anguage computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 HOagland, Al. 357 HOL. See head-of-line (HOL) blocking histograms, 382–383 histograms, 382–383 histograms, 382–383 histograms, 382–383 histograms, 382–381 histograms, 382–383 histog | trace caches and, 296 | cryptanalysis machines, K-4 | hot-swapping, E-67 | | hardware faults, 367, 379 hardware prefetching. See prefetching hardware-based speculation (104-114.) See also speculation (104-114.) See also speculation (104-114.) See also speculation (104-114.) See also speculation (106-114.) 110, 111, 113 Tomasulo's approach and, 105-109 Harvard architecture, K-4 hazards. See pipeline hazards; RAW hazards; WAR hazards; WAR hazards; WAR hazards; WAR hazards; WAW hazards was properly for the (HOL) blocking adaptive routing and, E-54 buffer organizations and, E-58, E-59 buffered crosshar switches and, E-62 congestion management and, E-64 virtual channels and, E-93 heaps, in compilers, B-27 to B-28 heat dissipation. 19 helical scan, K-59 Hemessy J. K. 12 to K. 13 HEP procussor, K-26 Hewlett-Packard PA-RISC. See PA-RISC High-Productivity Computing Systems (HPCS), F-51 higher-radix division, 1-95 to 1-58, 1-56, 1-57 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level language computer architecture, Light, See SuperH HILLCA (high-level language computer architecture, HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level language computer architecture, Light, See SuperH HILLCA (high-level language computer subtreating for the first of the productivity computing systems (HPCS), F-51 high-level language computer architecture, B-26, B-28, B-39 to B-43, B-45, K-11 high-order functions, register indirect jumps for, B-18 HILL A (high-level language computer architecture), B-26, B-28, B-39 to B-43, B-45, K-11 high-evel language computer subtreations, B-26, B-28, B-39 to B-43, B-45, K-11 high-evel language computer architecture, B-26, B-28, B-39 to B-43, B-45, K-11 high-evel language computer subtreation and the special subtreating the processor of the programs, J-89 high-evel language computer architecture, B-26, B-28, B-39 to B-43, B-45, K-11 high-evel language computer architecture, B-26, B-28, B-39 to B-43, B-45, K-11 high-evel language computer architecture, B-26, B-28, B-39 to B-43, B-45, K-11 high-evel language computer architec | hardware description notation, J-25 | on DRAM, 312, <b>313</b> | | | hardware prefetching. See prefetching hardware prefetching. See prefetching hardware-based speculation, 104–114. See also speculation data flow value prediction, 170 inimitations of, 170–171, 118–118.] reorder buffer in, 106–114, 107, 110, 111, 113 romasulo's approach and, 105–109 appr | hardware faults, 367, 370 | | • | | hardware-based speculation, 104–114. See also speculation data flow value prediction, 170 limitations of, 170–171, 182–183 recorder buffer in, 106–114, 107, 110, 111, 113 Tomasulo's approach and, 105–109 Harvard architecture, K-4 hazards; WAW hazards. WAR bazards; WAR hazards; WAR hazards, WAR hazards, WAR hazards head-of-line (HOL) blocking adaptive routing and, E-54 buffer organizations and, E-58, E-59 buffered crossbar switches and, E-62 congestion management and, E-64 congular channels and, E-93 heat dissipation, 19 Hennessy, J., K-12 to K-13 Hemperson, K-26 Hemperson, K-26 Hewlett-Packard PA-RISC. See PA-RISC Higher-radix division, 1-55 to 1-58, 1-55, 1-56, 1-57 higher-radix division, 1-55 to 1-58, 1-49, 1-49 high-level language computer architecture (HLCA), B-26, B-28, B-39 to B-43, B-45, K-11 limit, M. D., 247, K-54 Hillis, Danny, K-38 histograms, 382–383 hi | | | | | See also speculation data flow value prediction, 170 limitations of, 170–171, 182–183 reorder buffer in, 106–114, 107, 110, 111, 113 Tomasulo's approach and, 105–109 Harvard architecture, K-4 hazards. See pipeline hazards; RAW hazards, WAR hazards; WAR hazards; WAR hazards was hazards head-of-line (HOL) blocking adaptive routing and, E-54 buffer organizations and, E-58, E-59 buffered crossbar switches and, E-62 congestion management and, E-64 congustion management and, E-64 heical scan, K-59 heat dissipation, 19 heical scan, K-59 heat dissipation, 19 heical scan, K-59 heat dissipation, 19 heical scan, K-59 Hennessy, J., K-12 to K-13 HEP processor, K-26 High Productivity Computing Systems (HPCS), F-51 higher-radix division, 1-55 to 1-58, L-155, 1-56, 1-57 higher-radix multiplication, 1-48 to L-91, L-92 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 Horeardix division, 1-55 to 1-58, L-151, 1-40, L-20 High-level optimizations, B-26, B-28 high-order functions, R-26, | | on floating point, K-4 to K-5 | | | data flow value prediction, 170 limitations of, 170-171, 182-183 reorder buffer in, 106-114, 107, 110, 111, 113 Tomasulo's approach and, 105-109 Harvard architecture, K-4 hazards, Year piepline hazards; RAW hazards; WAR hazards white on interconnection networks, E-97 head-of-line (HOL) blocking adaptive routing and, E-54 buffer organizations and, E-58, E-59 buffered crossbar switches and, E-62 congestion management and, E-64 betted crossbar switches and, E-65 head dissipation, 19 heanessy, J., K-12 to K-13 Hennessy, J., K-12 to K-13 Hennessy, J., K-12 to K-13 Hewlett-Packard PA-RISC. See PA-RISC Hewlett-Packard PA-RISC. See PA-RISC Hewlett-Packard PA-RISC, See PA-RISC Hewlett-Packard PA-RISC, See PA-RISC Heyler-radix division, 1-55 to 1-58, 1-55, 1-56, 1-57 higher-radix multiplication, 1-48 to 1-49, 1-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 Hillis, Danny, K-38 histograms, 382-383 | <u>-</u> | | | | limitations of, 170–171, 182–183 reorder buffer in, 106–114, 107, 110, 111, 113 Tomasulo's approach and, 105–109 on Intel IA-64 and Itanium 2, G-44 on interconnection networks, E-97 to E-104 on magnetic storage, K-59 to Mazards; WAR hazards; WAR hazards; WAR hazards where the proteins and, E-54 buffer organizations and, E-58, E-59 on multiprocessors and parallel processing, K-34 to K-45 on pipelining and ILP, K-18 to K-22 hubs, E-79 to except processor, K-26 to K-7 on excetor processors, F-47 to F-51 history file, A-55 hit time associativity and, C-28 to C-29 average memory access time and, E-64 pa-RISC. See PA-RISC PARISC. See PA-RISC PARISC See PA-RISC PARISC See PA-RISC PARISC See PA-RISC | | | | | reorder buffer in, 106–114, 107, 110, 111, 113 Tomasulo's approach and, 105–109 Harvard architecture, K-4 hazards, WAR hazards; RAW hazards WAR hazards; WAW hazards WAR hazards; WAW hazards Ward hazards was additionally addi | <del>-</del> | | | | 110, 111, 113 Tomasulo's approach and, 105–109 Harvard architecture, K-4 hazards; WAR hazards; RAW hazards; WAR hazards; WAW hazards hazards; WAR hazards; WAW hazards hazards; WAR hazards; WAW hazards hazards; WAR hazards; WAW hazards hazards; WAR hazards; WAR hazards hazards; WAR hazards; WAR hazards hazards hazards; WAR hazards; WAR hazards hazards; WAR hazards; WAR hazards hazards hazards hazards hazards; WAR hazards; WAR hazards hazards; WAR hazards hazards hazards hazards; WAR hazards hazards; WAR hazards hazards hazards hazards; WAR hazards hazards; WAR hazards hazards; WAR hazards hazards hazards; WAR hazards hazards hazards helod-of-line (HoL) hypercubes, E-36, E-37, E-40, E-92, htw, K-15 history file, A-55 hit time associativity and, C-28 to C-29 average memory access time and, c-15 high-er-adix division, I-55 to I-58, lituel (LCA), B-26, | | | | | Tomasulo's approach and, 105-109 on Intel 1A-64 and Itanium 2, G-44 interventing and architecture, K-4 in bazards; RAW hazards WAR hazards; WAW hazards watch and publifier organizations and, E-54 on magnetic storage, K-59 to buffer organizations and, E-54 on multiprocessors and parallel processing, K-34 to K-45 on pipelining and ILP, K-18 to K-22 in on quantitative performance virtual channels and, E-93 head dissipation, 19 heigher beliefal scan, K-59 heigher Ackard PA-RISC PA-RISC Hewlett-Packard PA-RISC See PA-RISC Heylett-Packard PA-RISC See PA-RISC High-Productivity Computing Systems (HPCS), F-51 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 historical perspectives, K-10 to K-67 on modes, 232, 233 historical perspectives, K-10 to K-67 on line 1A-64 and Itanium 2, G-44 and Itanium 2, G-44 and Itanium 2, G-44 and Itanium 2, G-44 and Itanium 2, G-44 on line to be interconnection networks, E-97 to E-10 high-level language computer architecture (HLCA), B-26, B-28, B-39 to B-43, B-45, K-11 hoogland, Al. 357 HOL, See head-of-line (HOL) blocking and parallel processors, K-20 home work and parallel processors and parallel processors and parallel processors, K-26 to K-27 on periodecion, K-52 to K-54 to N-47 to K-27 on processors, K-45 to K-7 on vector processors, F-47 to F-51 history file, A-55 history file, A-55 hit time associativity and, C-28 to C-29 architecture in J-2, J-42, K-10 development of J-83 to J-89 architecture in J-2, J-42, K-10 development of J-83 to J-89 hardware-based speculation in, 171 instruction sets in, J-85 to J-88, J-85, J-86, J-87, J-88 memory acches in, K-53 memory protection in, K-52 virtual memory in, K-53 history and parallel processors and parallel processors, K-20 memory caches in, K-53 in India | | | | | 105–109 Harvard architecture, K-4 hazards, See pipeline hazards; RAW hazards, WAR hazards; WAW WAR hazards, WAR hazards; WAR hazards, h | | | * * * * * * * * * * * * * * * * * * * * | | Harvard architecture, K-4 hazards, See pipeline hazards; WAR hazards work hazards whith hazards with the processing K-59 to magnetic storage, K-59 to K-54 buffer organizations and, E-58, E-59 buffer organizations and, E-58, E-62 congestion management and, E-64 virtual channels and, E-93 head sistingtion, 19 helical scan, K-59 Help processor, K-26 High Productivity Computing Systems (HPCS), F-51 higher-radix division, 1-55 to 1-58, 1-49, 1-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 Hills, Danny, K-38 histogrian, 382–383 historical perspectives, K-1 to K-67 homewell Bull, K-61 HCCD (High Productivity Computing Systems, to E-104 to E-104 hom magnetic storage, K-59 to H-8 to E-104 on magnetic storage, K-59 to Hww, Wen-Meit, K-24 hybrid encoding, B-22, B-23 hybrecubes, E-36, E-37, E-40, E-92, K-41 HyperTransport, E-63 HyperCransport, E-63 hypercubes, E-36, E-37, E-40, E-92, K-41 HyperTransport, hypercubes, E-36, E-37, E-40, E-92, hypercubes, E-36, E-37, E-40, E-92, hypercubes, E-36, E-37, E-40, E-92, hypercubes, E-36, E-37, E-40, E-92, hypercubes, E-36, E-37, E-40, E-92, hypercubes, E-3 | | | | | hazards. See pipeline hazards; RAW hazards; WAR hazards; WAR hazards; WAR hazards head-of-line (HOL) blocking adaptive routing and, E-54 buffer organizations and, E-58, E-59 buffered crossbar switches and, E-62 congestion management and, E-64 virtual channels and, E-93 heaps, in compilers, B-27 to B-28 heat dissipation, 19 helical scan, K-59 Helnenessy, J, K-12 to K-13 HEP processor, K-26 Hewlett-Packard PA-RISC. See PA-RISC High-radix division, 1-55 to 1-58, 1-55, 1-56, 1-57 higher-radix multiplication, 1-48 to 1-49, 1-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level potimizations, P-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 Hills, Danny, K-38 historrical perspectives, K-1 to K-67 homeywell Bull, K-61 | | | | | hazards; WAR hazards; WAW hazards haad-of-line (HOL) blocking adaptive routing and, E-54 buffer organizations and, E-58, E-59 buffered crossbar switches and, E-64 congestion management and, E-64 virtual channels and, E-93 heaps, in compilers, B-27 to B-28 heat dissipation, 19 helical scan, K-59 Helical scan, K-59 Helmessy, J., K-12 to K-13 Helper-radix division, 1-55 to 1-58, 1-155, 1-56, 1-57 higher-radix division, 1-55 to 1-58, K-11 high-level anguage computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, P-26, B-28 high-order functions, register indirect jumps for, B-18 Hills, Danny, K-38 historrial perspectives, K-1 to K-67 homeway In Markey In Jack 19 home nodes, 232, 233 historical perspectives, K-1 to K-67 hom memory hierarchy and hybrid encoding, B-22, B-23 hwhy, Wen-Mei, K-24 hybrid encoding, B-22, B-23 hybrid encoding, B-22, B-23 hybrid encoding, B-22, B-23 hyprerubes, E-36, E-37, E-40, E-92. K-41 HyperTransport, E-63 HyperTra | | | | | head-of-line (HOL) blocking adaptive routing and, E-54 buffer organizations and, E-58, E-59 buffer organizations and, E-58, E-62 congestion management and, E-62 congestion management and, E-64 virtual channels and, E-93 heat dissipation, 19 history file, A-55 hit time Helbert-Packsor PA-RISC Hewlett-Packard PA-RISC. See PA-RISC Higher-radix multiplication, 1-48 to 1-49, 1-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 histograms, 382–383 historical perspectives, K-1 to K-61 hush, E-79 hwu, Wen-Mei, K-24 hybrid encoding, B-22, B-23 hybrideracty and hybrid encoding, B-22, B-23 hybrideracty and hybrid encoding, B-22, B-23 hybrideroction, K-52 to K-54 no numbrory hierarchy and hybrid encoding, B-22, B-23 hybrideracty and parallel processors and parallel processing, K-34 to K-45 no pipelining and LP, K-18 to K-27 on quantitative performance measures, K-6 to K-7 on equantitative performance measures, K-6 to K-7 on vector processors, F-47 to F-51 history file, A-55 hittime K-27 I HyperTransport, E-63 K-41 HyperTransport, E-63 I Hyp | | | | | head-of-line (HOL) blocking adaptive routing and, E-54 buffer organizations and, E-58, E-59 buffered crossbar switches and, E-62 congestion management and, E-64 virtual channels and, E-93 heaps, in compilers, B-27 to B-28 heat dissipation, 19 helical scan, K-59 Help processor, K-26 K-2 | | | | | adaptive routing and, E-54 buffer organizations and, E-58, E-59 | | | | | buffer organizations and, E-58, E-59 on multiprocessors and parallel buffered crossbar switches and, E-62 on pipelining and IL.P, K-18 to E-62 on pipelining and IL.P, K-18 to K-27 on quantitative performance virtual channels and, E-93 measures, K-6 to K-7 on vector processors, F-47 to F-51 history file, A-55 | <del>_</del> | | | | BE-59 on multiprocessors and parallel processing, K-34 to K-45 on pipelining and ILP, K-18 to K-27 on quantitative performance measures, K-6 to K-7 on vector processors, F-47 to F-51 history file, A-55 hit time associativity and, C-28 to C-29 average memory access time and, Hewlett-Packard PA-RISC. See PA-RISC during cache indexing, 291–292. Higher-radix multiplication, I-55 to 1-58, I-55, I-56, I-57 trace caches and, 296, 309 hit under miss optimization, 296–298, I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M, D., 247, K-54 Hill, M, D., 247, K-54 Hillis, Danny, K-38 historical perspectives, K-1 to K-67 on vultiprocessors and parallel processing, K-34 to K-45 on pipelining and ILP, K-18 K-47 on quantitative performance measures, K-6 to K-7 on vector processors, K-4 to K-47 IA-64. See Intel IA-64 IAS computer, K-3 IBM 360, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 architecture in, J-2, | | | • | | buffered crossbar switches and, E-62 on pipelining and II.P, K-18 to congestion management and, E-64 on quantitative performance virtual channels and, E-93 measures, K-6 to K-7 on vector processors, F-47 to F-51 history file, A-55 hit time history file, A-55 hit time A-8. R-18 computer, K-26 during cache indexing, 291–292. High-radix multiplication, I-98 hit under miss optimization, 296–298, I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M, D., 247, K-54 Hill, M, D., 247, K-54 Hillis, Danny, K-38 historical perspectives, K-1 to K-67 history and II.P, K-18 to K-45 on pipelining and II.P, K-18 to K-27 on pipelining and II.P, K-18 to K-27 lland II.P, K-18 to K-27 on pipelining and II.P, K-18 to K-27 lland K-45 on pipelining and II.P, K-18 to K-27 lland II.P, K-18 to K-45 on pipelining and II.P, K-18 to K-27 lland II.P, K-18 to K-45 on pipelining and II.P, K-18 to K-45 on pipelining and II.P, K-18 to K-27 lland II.P, K-26 to K-29 measures, K-6 to K-7 on quantitative performance measures, K-6 to K-7 on vector processors, K-46 to K-47 lla-64. See Intel IA-64. | = | | | | E-62 congestion management and, E-63 congestion management and, E-64 on pipelining and ILP, K-18 to K-27 on quantitative performance virtual channels and, E-93 on vector processors, K-6 to K-7 on vector processors, F-47 to F-51 history file, A-55 hit time A-56 hit time history file, A-55 hit time history file, A-55 hit tim | | | | | congestion management and, E-64 on quantitative performance virtual channels and, E-93 measures, K-6 to K-7 on vector processors, F-47 to F-51 history file, A-55 hit time associativity and, C-28 to C-29 average memory access time and, E-93 high-radix division, I-55 to I-58, I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, k-11 history file, A-55 hist under miss optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M, D., 247, K-54 historical perspectives, K-1 to K-67 history caches in, K-50 no vector processors, F-47 to F-51 history file, A-55 hist time measures, K-6 to K-7 on vector processors, F-47 to F-51 la-64 K-56 index on vector processors, F-47 to F-51 la-64 K-56 index on vector processors, F-47 to F-51 la-64 K-56 index on vector processors, F-47 to F-51 la-64 K-56 index on vector processors, F-47 to F-51 la-64 K-56 index on vector processors, F-47 to F-51 la-64 K-56 index on vector processors, F-47 to F-51 la-64 K-56 index of the A-47 la-64 K-56 index of K-7 index on vector processors, F-47 to F-51 la-64 K-56 index on vector processors, F-47 to F-51 la-65 to K-53 index on vector processors, F-47 to F-51 la-64 K-56 index on vector processors, F-47 to F-51 la-64 K-56 index on vector processors, F-47 to F-51 la-64 K-56 index on vector processors, F-47 to F-51 la-64 K-56 index on vector processors, F-47 to F-51 la-64 K-56 index on vector processors, F-47 to F-51 la-65 lass on vector processors, F-47 to F-51 la-65 lass on vector processors, F-47 to F-51 procesors in development of, J-83 to J-89 dynamic scheduling in, 92 har | • | | HyperTransport, E-63 | | E-64 on quantitative performance virtual channels and, E-93 measures, K-6 to K-7 on vector processors, F-47 to F-51 history file, A-55 hit time associativity and, C-28 to C-29 average memory access time and, E-93, K-12 to K-13 associativity and, C-28 to C-29 average memory access time and, C-15 during cache indexing, 291–292. C-36 to C-38, C-37 cache size and, 296, 399 hit under miss optimization, I-55, I-56, I-57 higher-radix division, I-55 to I-58, I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 histograms, 382–383 historical perspectives, K-1 to K-67 on vector processors, F-47 to K-7 in measures, K-6 to K-7 on vector processors, F-47 to F-51 la-64. See Intel IA-64 in IA-64. See Intel IA-64. See Intel IA-64. See Intel IA-64. See Intel IA-64 in IA-64. See Intel | | <del>-</del> | | | virtual channels and, E-93 heaps, in compilers, B-27 to B-28 heat dissipation, 19 helical scan, K-59 helical scan, K-59 help processor, K-26 hewlett-Packard PA-RISC. See PA-RISC High Productivity Computing Systems (HPCS), F-51 higher-radix division, I-55 to I-58, I-56, I-57 higher-radix multiplication, I-48 to I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 history file, A-55 hit intime associativity and, C-28 to C-29 ascociativity and, C-28 to C-29 ascociativity and, C-28 to C-29 ascociativity and, C-28 to C-29 ascociativity and, C-28 to C-29 ascociativity and, C-28 to C-29 lib associativity and, C-28 to C-29 ascociativity and, C-28 to C-29 lib associativity and, C-28 to C-29 lib associativi | | | 1 | | heaps, in compilers, B-27 to B-28 heat dissipation, 19 history file, A-55 hit time Hennessy, J., K-12 to K-13 associativity and, C-28 to C-29 average memory access time and, C-15 during cache indexing, 291–292, C-36 to C-38, C-37 cache size and, 293–295, 294 defined, 290 trace caches and, 296, 309 hit under miss optimization, 296–298, L-49, L-49 architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 history file, A-55 hit time Acceptable A-Simple Productivity and, C-28 to C-29 average memory access time and, C-15 during cache indexing, 291–292, C-36 to C-38, C-37 cache size and, 293–295, 294 defined, 290 trace caches and, 296, 309 hit under miss optimization, 296–298, L-49, L-49 hit under miss optimization, 296–298, B-28, B-39 to B-43, B-45, K-11 hoagland, Al. 357 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 hillis, Danny, K-38 histograms, 382–383 histograms, 382–383 histograms, 382–383 historical perspectives, K-1 to K-67 hit time associativity and, C-28 to C-29 asercated, C-15 IBM 360, J-83 to J-89 architecture, i.J-2, J-42, K-10 development of, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 hardware-based speculation in, 171 instruction usage in programs, J-89 hardware-based speculation in, L-55, J-86, J-87, J-88 memory addressing in, B-8, K-9, K-53 memory caches in, K-53 memory protection in, K-52 virtual memory in, K-53 IBM 360/91 dynamic scheduling in, 92 hardware-based speculation in, Hoagland, Al. 357 hone nodes, 232, 233 | | | IA-32 microprocessors, A-46 to A-47 | | heat dissipation, 19 helical scan, K-59 helical scan, K-59 hit time associativity and, C-28 to C-29 HEP processor, K-26 HEP processor, K-26 HEP processor, K-26 HEP productivity Computing Systems (HPCS). F-51 higher-radix division, I-55 to I-58, I-55, I-56, I-57 higher-radix multiplication, I-48 to I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 histograms, 382–383 historical perspectives, K-1 to K-67 high energy access time and, C-28 to C-29 average memory access time and, C-15 associativity and, C-28 to C-29 associativity and, C-28 to C-29 associativity and, C-28 to C-29 average memory access time and, C-15 C-15 average memory access time and, C-26 to C-38, C-37 cache size and, 291–292. C-36 to C-38, C-37 cache size and, 293–295, 294 defined, 290 trace caches and, 296, 309 hit under miss optimization, 296–298, Litachi S810/S820, F-7, F-34, F-49 Hitachi S810/S820, F-7, F-34, F-49 Hitachi S810/S820, F-7, F-34, F-49 Hitachi S810/S820, F-7, F-34, F-49 Hitachi SuperH. See SuperH HLLCA (high-level language computer architecture), B-26, K-53 memory acches in, K-53 memory acches in, K-53 memory acches in, K-53 memory acches in, S-53 lbM 360, J-83 to J-89 hardware-based speculation in, K-52 virtual memory in, K-52 virtual memory in, K-53 lbM 360/91 dynamic scheduling in, 92 hardware-based speculation in, 171 instruction sets in, J-85 to J-88, J-85, J-86, J-87, J-88 memory protection in, K-52 virtual memory in, K-53 lbM 360/91 dynamic scheduling in, 92 hardware-based speculation in, 171 instruction sets in, J-85 to J-88, J-85, J-86, J-87, J-89 hardware-based speculation in, K-52 virtual memory in, K-53 lbM 360/91 dynamic scheduling in, 92 hardware-based speculation in, 171 instruction sets in, J-85 to J-88, J-85, J-86, J-87, J-89 hardware-based speculation in, M-52 virtual memory in, K-53 lbM 360/91 dynamic scheduling in, 92 hardware-based speculation in, | , | | IA-64. See Intel IA-64 | | helical scan, K-59 Hennessy, J., K-12 to K-13 HEP processor, K-26 Hewlett-Packard PA-RISC. See PA-RISC High Productivity Computing Systems (HPCS), F-51 higher-radix division, I-55 to I-58, I-55, I-56, I-57 higher-radix multiplication, I-48 to I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-evel optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 hillis, Danny, K-38 histograms, 382–383 historical perspectives, K-1 to K-67 hit time associativity and, C-28 to C-29 associativity and, C-28 to C-29 associativity and, C-28 to C-29 associativity and, C-28 to C-29 associativity and, C-28 to C-29 average memory access time and, C-15 during cache indexing, 291–292. C-36 to C-38, C-37 cache size and, 293–295, 294 defined, 290 trace caches and, 296, 309 hit under miss optimization, 296–298, 297 Litachi SuperH. See SuperH HLLCA (high-level language computer architecture), B-26, B-28, B-39 to B-43, B-45, K-11 Hoagland, Al, 357 HOL. See head-of-line (HOL) blocking home nodes, 232, 233 historical perspectives, K-1 to K-67 hit time E-44, E-56 IBM 360, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 architecture in, J-2, J-42, K-10 development of, J-83 to J-89 frequency of instruction usage in programs, J-89 hardware-based speculation in, 171 instruction sets in, J-85 to J-88, J-85, J-86, J-87, J-88 memory addressing in, B-8, K-9, K-53 memory caches in, K-53 memory caches in, K-53 innovations in, K-20 memory caches in, K-53 | | | IAS computer, K-3 | | helical scan, K-59 Hennessy, J., K-12 to K-13 HEP processor, K-26 HEP processor, K-26 Hewlett-Packard PA-RISC. See PA-RISC High Productivity Computing Systems (HPCS), F-51 higher-radix division, I-55 to I-58, I-56, I-57 higher-radix multiplication, I-48 to I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 hill, M. D., 247, K-54 hillis, Danny, K-38 histograms, 382–383 histograms, 382–383 historical perspectives, K-1 to K-67 hit uime associativity and, C-28 to C-29 average memory access time and, C-15 development of, J-83 to J-88 dynamic scheduling in, 92 frequency of instruction usage in programs, J-89 hardware-based speculation in, 171 instruction sets in, J-85 to J-88, J-85, J-86, J-87, J-88 memory addressing in, B-8, K-9, K-53 memory caches in, K-53 memory caches in, K-53 innovations in, K-20 memory caches in, K-53 | heat dissipation, 19 | history file, A-55 | IBM 3ASC Purple pSeries 575, E-20, | | HEP processor, K-26 average memory access time and, Hewlett-Packard PA-RISC. See PA-RISC during cache indexing, 291–292. High Productivity Computing Systems (HPCS). F-51 cache size and, 293–295, 294 defined, 290 trace caches and, 296, 309 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 computer sight-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 HOL. See head-of-line (HOL) histograms, 382–383 historical perspectives, K-1 to K-67 Age architecture, M-26, Holes and programs average memory access time and, architecture in, J-2, J-42, K-10 development of, J-83 to J-84 dynamic scheduling in, 92 frequency of instruction usage in programs, J-89 hardware-based speculation in, 171 instruction sets in, J-85 to J-88, J-85, J-86, J-87, J-88 memory addressing in, B-8, K-9, K-53 memory caches in, K-53 memory protection in, K-52 virtual memory in, K-53 in M-360/91 dynamic scheduling in, 92 hardware-based speculation in, 171 instruction sets in, J-85 to J-88, J-85, J-86, J-87, J-88 memory addressing in, B-8, K-9, K-53 memory protection in, K-52 virtual memory in, K-53 in M-360/91 dynamic scheduling in, 92 hardware-based speculation in, 171 instruction sets in, J-85 to J-88, J-85, J-86, J-87, J-88 memory addressing in, B-8, K-9, K-53 in M-360/91 dynamic scheduling in, 92 hardware-based speculation in, 171 innovations in, K-20 memory caches in, K-53 in M-360/91 innovations in, K-20 memory caches in, K-53 | helical scan, K-59 | hit time | _ <del>-</del> | | HEP processor, K-26 Hewlett-Packard PA-RISC. See PA-RISC High Productivity Computing Systems (HPCS), F-51 higher-radix division, I-55 to I-58, I-55, I-56, I-57 higher-radix multiplication, I-48 to I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 hill, M. D., 247, K-54 histograms, 382–383 382–384 histograms, 382–384 histograms, 382–384 histograms, 382–385 histograms, | Hennessy, J., K-12 to K-13 | associativity and, C-28 to C-29 | IBM 360, J-83 to J-89 | | Hewlett-Packard PA-RISC. See PA-RISC High Productivity Computing Systems (HPCS). F-51 cache size and, 293–295, 294 higher-radix division, I-55 to I-58, I-55, I-56, I-57 higher-radix multiplication, I-48 to I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 Hill, M. D., 247, K-54 Hills, Danny, K-38 histograms, 382–383 historical perspectives, K-1 to K-67 High Productivity Computing Systems (C-36 to C-38, C-37 cache size and, 293–295, 294 defined, 290 trace caches and, 296, 309 hit under miss optimization, 296–298, leffined, 290 trace caches and, 296, 309 hit under miss optimization, 296–298, litachi SuperH. See SuperH Hitachi S810/S820, F-7, F-34, F-49 Hitachi SuperH. See SuperH HLLCA (high-level language computer architecture), B-26, B-28, B-39 to B-43, B-45, locking home nodes, 232, 233 historical perspectives, K-1 to K-67 Honeywell Bull, K-61 development of, J-83 to J-84 dynamic scheduling in, 92 frequency of instruction usage in programs, J-89 hardware-based speculation in, 171 instruction sets in, J-85 to J-88, J-85, J-86, J-87, J-88 memory addressing in, B-8, K-9, K-53 memory caches in, K-53 memory protection in, K-52 virtual memory in, K-53 liBM 360/91 dynamic scheduling in, 92 frequency of instruction usage in programs, J-89 hardware-based speculation in, 171 instruction sets in, J-85 to J-88, J-85, J-86, J-87, J-86 memory addressing in, B-8, K-9, K-51 Holl, M. D., 247, K-54 HOL. See head-of-line (HOL) home nodes, 232, 233 historical perspectives, K-1 to K-67 Honeywell Bull, K-61 | HEP processor, K-26 | average memory access time and, | | | PA-RISC High Productivity Computing Systems (HPCS). F-51 higher-radix division, I-55 to I-58, I-56, I-57 higher-radix multiplication, I-48 to I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 high Productivity Computing Systems (C-36 to C-38, C-37 cache size and, 293–295, 294 defined, 290 hardware-based speculation in, 171 hit under miss optimization, 296–298, 1785, J-86, J-87, J-88 hit under miss optimization, 296–298, 1785, J-86, J-87, J-88 Hitachi S810/S820, F-7, F-34, F-49 Hitachi SuperH. See SuperH HLLCA (high-level language memory addressing in, B-8, K-9, K-53 memory caches in, K-53 memory protection in, K-52 virtual memory in, K-53 high-order functions, register indirect jumps for, B-18 HOL. See head-of-line (HOL) hillis, Danny, K-38 historical perspectives, K-1 to K-67 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 HOL. See head-of-line (HOL) home nodes, 232, 233 historical perspectives, K-1 to K-67 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 HOL. See head-of-line (HOL) home nodes, 232, 233 historical perspectives, K-1 to K-67 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 HOL. See head-of-line (HOL) home nodes, 232, 233 historical perspectives, K-1 to K-67 home nodes, 232, 233 historical perspectives, K-1 to K-67 | Hewlett-Packard PA-RISC. See | C-15 | | | High Productivity Computing Systems (HPCS), F-51 higher-radix division, I-55 to I-58, I-55, I-56, I-57 higher-radix multiplication, I-48 to I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 Hill, M. D., 247, K-54 high-level perspectives, K-1 to K-67 High Productivity Computing Systems (C-36 to C-38, C-37 cache size and, 293–295, 294 defined, 290 hardware-based speculation in, 171 instruction sets in, J-85 to J-88, J-85, J-86, J-87, J-88 memory addressing in, B-8, K-9, K-53 memory caches in, K-53 memory caches in, K-53 memory protection in, K-52 virtual memory in, K-53 IBM 360/91 dynamic scheduling in, 92 hardware-based speculation in, 171 innovations in, K-20 memory caches in, K-53 | PA-RISC | during cache indexing, 291-292, | • | | higher-radix division, I-55 to I-58, I-55, I-56, I-57 higher-radix multiplication, I-48 to I-49, I-49 higher-radix multiplication, I-48 to I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 Hills, Danny, K-38 histograms, 382–383 historical perspectives, K-1 to K-67 cache size and, 293–295, 294 defined, 290 hardware-based speculation in, 171 innstruction sets in, J-85 to J-88, I-77, F-34, F-49 Hitachi S810/S820, F-7, F-34, F-49 Hitachi SuperH. See SuperH HLLCA (high-level language computer architecture), B-26, B-28, B-39 to B-43, B-45, K-11 Hoagland, Al, 357 HOL. See head-of-line (HOL) blocking home nodes, 232, 233 historical perspectives, K-1 to K-67 honeywell Bull, K-61 programs, J-89 hardware-based speculation in, 171 instruction sets in, J-85 to J-88, I-78, J-85, J-86, J-87, J-88 memory addressing in, B-8, K-9, K-53 memory caches in, K-53 memory protection in, K-52 virtual memory in, K-53 IBM 360/91 dynamic scheduling in, 92 hardware-based speculation in, 171 innovations in, K-20 memory caches in, K-53 | High Productivity Computing Systems | C-36 to C-38, C-37 | | | higher-radix division, I-55 to I-58, I-55, I-56, I-57 higher-radix multiplication, I-48 to I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 Hills, Danny, K-38 histograms, 382–383 historical perspectives, K-1 to K-67 high-level division, I-55 to I-58, trace caches and, 296, 309 hit under miss optimization, 296–298, trace caches and, 296, 309 hit under miss optimization, 296–298, I-49, I-49 Hit under miss optimization, 296–298, I-49, I-49 Hitachi S810/S820, F-7, F-34, F-49 Hitachi SuperH. See SuperH HLLCA (high-level language computer architecture), B-26, B-28, B-39 to B-43, B-45, K-11 Hoagland, Al, 357 HOL. See head-of-line (HOL) blocking home nodes, 232, 233 historical perspectives, K-1 to K-67 hit under miss optimization, 296–298, I-71 instruction sets in, J-85 to J-88, I-71 instruction sets in, J-85 to J-88, I-71 Instruction sets in, J-85 to J-88, I-71 Instruction sets in, J-85 to J-88, I-75, I-86, J-87, J-88 memory addressing in, B-8, K-9, K-53 memory caches in, K-53 IBM 360/91 dynamic scheduling in, 92 hardware-based speculation in, I71 innovations in, K-20 memory caches in, K-53 | (HPCS), F-51 | cache size and, 293-295, 294 | - · · · · · · · · · · · · · · · · · · · | | I-55, I-56, I-57 trace caches and, 296, 309 higher-radix multiplication, I-48 to I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 Hills, Danny, K-38 histograms, 382–383 historical perspectives, K-1 to K-67 high-level optimization, I-48 to hit under miss optimization, 296–298, high-level language computer architecture), B-26, hit under miss optimization, 296, B-8, K-9, high-level language computer architecture), B-26, hit under miss | higher-radix division, I-55 to I-58, | defined, 290 | | | higher-radix multiplication, I-48 to I-49, I-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hoagland, Al, 357 Hill, M. D., 247, K-54 Hills, Danny, K-38 histograms, 382–383 historical perspectives, K-1 to K-67 high-level optimization, I-48 to I-49, I-49 297 Hitachi S810/S820, F-7, F-34, F-49 Hitachi SuperH. See SuperH K-53 Memory addressing in, B-8, K-9, K-53 memory caches in, K-53 memory protection in, K-52 virtual memory in, K-53 IBM 360/91 dynamic scheduling in, 92 hardware-based speculation in, I71 histograms, 382–383 historical perspectives, K-1 to K-67 high-level language Computer architecture), B-26, Mitachi S810/S820, F-7, F-34, F-49 Hitachi S810/S820, F-7, F-34, F-49 Hitachi S810/S820, F-7, F-34, F-49 Hitachi S910/S820, S910/S920, F-7, F-34, F-49 Hitachi S910/S920, F-7, F-34, F-49 HItachi S910/S920, F-7, F-34, F-49 HItachi S910/S920, F-7, F-34, F-49 HItachi S910/S920, F-7, F-34, F-49 HItachi S910/S920, F-7, F-34, F- | | trace caches and, 296, 309 | | | 1-49, 1-49 high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 Hills, Danny, K-38 high-order spectives, K-1 to K-67 Hills, Danny, K-38 high-level language birth size (HLLCA), B-26, Hitachi SuperH. See SuperH K-53 memory addressing in, B-8, K-9, K-53 memory caches in, K-53 memory protection in, K-52 wirtual memory in, K-53 high-order functions, register indirect jumps for, B-18 Hoagland, Al, 357 HOL. See head-of-line (HOL) hardware-based speculation in, blocking 171 histograms, 382–383 historical perspectives, K-1 to K-67 Honeywell Bull, K-61 memory caches in, K-53 memory protection in, K-52 wirtual memory in, K-53 high-order (HOL) hardware-based speculation in, 171 minovations in, K-20 memory caches in, K-53 | higher-radix multiplication, I-48 to | | | | high-level language computer architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hill, M. D., 247, K-54 Hills, Danny, K-38 histograms, 382–383 historical perspectives, K-1 to K-67 Hitachi S810/S820, F-7, F-34, F-49 Hitachi SuperH. See SuperH K-53 memory addressing in, B-8, K-9, K-53 memory caches in, K-53 memory rotection in, K-52 virtual memory in, K-53 IBM 360/91 dynamic scheduling in, 92 hardware-based speculation in, 171 histograms, 382–383 home nodes, 232, 233 historical perspectives, K-1 to K-67 Honeywell Bull, K-61 memory addressing in, B-8, K-9, M-53 memory addressing in, B-8, K-9, M-53 memory addressing in, B-8, K-9, M-53 memory addressing in, B-8, K-9, M-53 memory caches in, K-53 | | <del>-</del> | | | architecture (HLLCA), B-26, B-28, B-39 to B-43, B-45, K-11 high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 HOLS See head-of-line (HOL) Hillis, Danny, K-38 histograms, 382–383 historical perspectives, K-1 to K-67 Hitachi SuperH. See SuperH K-53 memory caches in, K-53 memory protection in, K-52 virtual memory in, K-53 IBM 360/91 dynamic scheduling in, 92 hardware-based speculation in, 171 histograms, 382–383 home nodes, 232, 233 historical perspectives, K-1 to K-67 Hitachi SuperH. See SuperH K-53 memory caches in, K-53 memory caches in, K-53 historical perspectives, K-1 to K-67 Honeywell Bull, K-61 memory caches in, K-53 | high-level language computer | Hitachi S810/S820, F-7, F-34, F-49 | | | B-28, B-39 to B-43, B-45, K-11 computer architecture), B-26, high-level optimizations, B-26, B-28 high-order functions, register indirect jumps for, B-18 Hoagland, Al, 357 Hill, M. D., 247, K-54 HOL. See head-of-line (HOL) histograms, 382–383 historical perspectives, K-1 to K-67 HLLCA (high-level language memory caches in, K-53 memory protection in, K-52 virtual memory in, K-53 IBM 360/91 dynamic scheduling in, 92 hardware-based speculation in, 171 histograms, 382–383 home nodes, 232, 233 historical perspectives, K-1 to K-67 HOLDAGA B-26, B-28, B-39 to B-43, B-45, virtual memory in, K-53 IBM 360/91 dynamic scheduling in, 92 hardware-based speculation in, 171 himovations in, K-20 memory caches in, K-53 | | | | | K-11 computer architecture), B-26, high-level optimizations, B-26, <b>B-28</b> high-order functions, register indirect jumps for, B-18 Hoagland, Al, 357 Holl, M. D., 247, K-54 HOL. See head-of-line (HOL) hardware-based speculation in, Hillis, Danny, K-38 home nodes, 232, 233 historical perspectives, K-1 to K-67 Honeywell Bull, K-61 memory protection in, K-52 virtual memory in, K-53 IBM 360/91 dynamic scheduling in, 92 hardware-based speculation in, 171 innovations in, K-20 memory caches in, K-53 | <b>B-28,</b> B-39 to B-43, B-45, | | | | high-level optimizations, B-26, <b>B-28</b> high-order functions, register indirect jumps for, B-18 Hoagland, Al, 357 Hill, M. D., 247, K-54 HOL. See head-of-line (HOL) histograms, 382–383 historical perspectives, K-1 to K-67 high-order functions, register indirect K-11 Hoagland, Al, 357 Hoagland, Al, 357 Hobel hour (HOL) hardware-based speculation in, 171 histograms, 382–383 home nodes, 232, 233 historical perspectives, K-1 to K-67 honeywell Bull, K-61 memory caches in, K-53 | | | • | | high-order functions, register indirect jumps for, B-18 Hoagland, Al, 357 Hill, M. D., 247, K-54 HOL. See head-of-line (HOL) histograms, 382–383 historical perspectives, K-1 to K-67 Hogeland, Al, 357 | high-level optimizations, B-26, B-28 | | · - | | jumps for, B-18 Hoagland, Al, 357 dynamic scheduling in, 92 Hill, M. D., 247, K-54 HOL. See head-of-line (HOL) hardware-based speculation in, Hillis, Danny, K-38 blocking 171 histograms, 382–383 home nodes, 232, 233 innovations in, K-20 historical perspectives, K-1 to K-67 Honeywell Bull, K-61 memory caches in, K-53 | | · | • | | Hill, M. D., 247, K-54 HOL. See head-of-line (HOL) hardware-based speculation in, blocking histograms, 382–383 home nodes, 232, 233 historical perspectives, K-1 to K-67 HOL. See head-of-line (HOL) hardware-based speculation in, 171 histograms, 382–383 home nodes, 232, 233 historical perspectives, K-1 to K-67 Honeywell Bull, K-61 memory caches in, K-53 | - · · · · · · · · · · · · · · · · · · · | | | | Hillis, Danny, K-38 blocking 171 histograms, 382–383 home nodes, 232, 233 innovations in, K-20 historical perspectives, K-1 to K-67 Honeywell Bull, K-61 memory caches in, K-53 | | | | | histograms, 382–383 home nodes, 232, 233 innovations in, K-20 historical perspectives, K-1 to K-67 Honeywell Bull, K-61 memory caches in, K-53 | | | <del>-</del> ' | | historical perspectives, K-1 to K-67 Honeywell Bull, K-61 memory caches in, K-53 | | | | | | | | | | incp county 2 30, 2 30, 2 30 | | • | memory caches in, K-53 | | | on elucions, in 12 to in Ti | 10p 10an, 2 00, 2 00, 2 TO, 2 00 | | #### I-16 Index | IBM 370 | tournament predictors in, 88 | injection bandwidth, E-18, E-26, E-41, | |--------------------------------------------|-------------------------------------------|------------------------------------------| | development of, J-84 | virtual registers in, 162 | E-55, E-63 | | floating point formats in, I-63 to | IBM PowerPC. See PowerPC | Inktomi search engine, K-42 | | I-64 | IBM RP3, K-40 | input-buffered switch, E-57, E-59, | | guest page tables in, 320 | IBM RS 6000, K-13 | <b>E-59,</b> E-62, E-73 | | memory addressing in, B-8 | IBM SAGE, K-63 | input-output-buffered switch, E-57, | | vector length control in, F-18 | IBM zSeries, F-49 | E-57, E-60, E-61, E-62 | | virtualization in, 319 | IBM/Motorola PowerPC. See | instruction buffers, A-15 | | IBM 701, K-3, K-5 | PowerPC | instruction cache misses, 227, 227, | | IBM 801, K-12 to K-13 | IC (instruction count), 42-44, C-4 | 329, <b>C-15</b> | | IBM 3033 cache, C-38 | ID. See instruction decode/register | instruction commit, 105-106, | | IBM 3090/VF, <b>F-34</b> | fetch cycle | 108–110, <b>110</b> | | IBM 3480 cartridge, K-59 | ideal pipeline CPI, 66, 67 | instruction count (IC), 42-44, C-4 | | IBM 7030, K-18 | IDE/ATA disks, 368, 369 | instruction decode, 90 | | IBM ASCI White SP Power3, E-20, | identifier fields, E-84 | instruction decode/register fetch cycle | | E-44, E-56 | Idle Control Register (ICR), D-8 | (ID) | | IBM Blue Gene/L | IEEE Computer Society Technical | branch hazards and, A-21 | | architecture of, E-20, H-41 to | Committee on Fault | in RISC instruction set, A-5 to | | H-42 | Tolerance, 366 | A-6 | | computing node in, H-42 to H-44, | IEEE standard for floating-point | in unpipelined MIPS | | H-43, H-44 | arithmetic | implementation, A-26 to | | custom clusters in, 198 | advantages of, I-13 to I-14 | A-27, <b>A-29</b> | | development of, K-43 | exceptions in, I-34 to I-35 | instruction delivery, 121-127, 122, | | interconnection networks in, E-72 | format parameters, I-16 | 124, 126 | | to E-74 | remainders, I-31 to I-32 | instruction fetch cycle (IF) | | routing, arbitration, and switching | rounding modes, I-20 | branch hazards and, A-21 | | characteristics in, E-56 | standardization process, 1-64 | in RISC instruction set, A-5, A-15 | | topological characteristics of, | underflow in, I-36 | in unpipelined MIPS | | E-44 | if conversion, G-24 | implementation, A-26 to | | IBM CodePack, B-23 | Illiac IV, K-35 | A-27, <b>A-29</b> | | IBM eServer p5 595, <b>47, 48, 49</b> | ILP. See instruction-level parallelism | instruction fetch units, 126-127 | | IBM eServer pSeries 690, <b>47, 48, 49</b> | immediate addressing mode, B-9, | instruction formats, J-7, J-8 | | IBM mainframes, virtualization in, | B-10 to B-13, <b>B-11</b> , <b>B-12</b> , | instruction groups, G-34 to G-35, | | 324–325 | B-13 | G-35 | | IBM Power2 processor, 130 | IMPACT, K-24 | instruction issue, A-33, A-67 to A-70, | | IBM Power4 processor, <b>52</b> , K-27 | implementation, in design, 8 | A-68, A-71 | | IBM Power5 processor | implicit parallelism, G-34 | instruction length, J-60, J-60. See also | | clock rate on, 139, <b>139</b> | in flight instructions, 156 | instruction count | | eServer p5 575, 178 | inclusion property, 211 | instruction packets, in embedded | | instruction-level parallelism in, | index field, C-8 to C-9, C-8 | systems, D-9, <b>D-10</b> | | 156 | index vectors, F-27, F-28 | instruction path length, 42-44, C-4 | | memory hierarchy of, 341 | indexed addressing mode, B-9, J-67 | instruction set architecture (ISA), | | multicore, 198 | indexing. See cache indexing | 8–12, B-1 to B-47. See also | | on-chip networks, E-73 | indirect networks, E-31, E-48, E-67 | RISC architectures | | performance on SPEC | inexact exceptions, I-35 | addressing modes and, B-9 to | | benchmarks, 255–257, <b>255</b> , | InfiniBand, E-4, E-64, E-74 to E-77, | B-10, B-9, B-11 to B-13 | | 256, 257, G-43 | E-75, E-76, E-102 | classification of, B-3 to B-7 | | simultaneous multithreading in. | infinite population model, 386 | compiler role in, B-24 to B-32 | | K-27 | initiation intervals, A-48 to A-49, | compiler technology and, B-27 to | | SMT performance on, 176-181, | A-49, A-62 | B-29 | | 178 to 181 | initiation rate, F-10 | | | conditional branch operations, | data hazards and, 71–72 defined, 66 | carry-lookahead adders, I-37 to I-41, I-38, I-40, I-41, I-42, | |----------------------------------------------------|----------------------------------------|---------------------------------------------------------------| | B-19, <b>B-19, B-20</b> | development of, K-24 to K-25 | I-44 | | in Cray X1, F-41<br>defined, 8–9 | in embedded systems, D-8 | carry-select adders, I-43 to I-44, | | encoding, B-21 to B-24, <b>B-22</b> | hardware vs. software approach | 1-43, 1-44 | | flaws in, B-44 to B-45 | to, 66 | carry-skip adders, I-41 to I-43, | | hardware, 9–10, <b>9</b> | instruction fetch bandwidth | I-42, I-44 | | high-level language structure and, | increases in, 121–127, <b>122</b> , | conversions to floating-point, I-62 | | B-39 to B-43, B-45 | 124, 126 | faster division with one adder, | | historical perspectives on, B-3, | limitations in ideal processors, | 1-54 to I-58, <b>I-55</b> , <b>I-56</b> , <b>I-57</b> | | B-45 to B-46, K-9 to K-15 | 154–165, 159, 160, 162, 163, | faster multiplication with many | | in IBM 360/370 mainframe, J-83 | 164 | adders, I-50 to I-54 I-50 to | | to J-89 | limitations in realizable | 1-54 | | instructions for control flow, B-16 | processors, 165-170, 166, | faster multiplication with single | | to B-21 | 181-184 | adders, I-47 to I-50, I-48, | | integer arithmetic issues in, I-10 | loop unrolling and, 75–80, 75, | 1-49 | | to I-13, I-11, I-12, I-13 | 117–118 | in Intel 80x86, J-50 to J-52, <b>J-52</b> , | | in Intel 80x86, J-45 to J-65 | loop-level parallelism, 67-68 | J-53 | | in Intel IA-64, G-32 to G-40, | multiple issue and speculation | radix-2 multiplication and | | G-35, G-36, G-37, G-39 | examples, 118-121, 120, 121 | division, I-4 to I-7, I-4, I-6, | | measurements, B-2 | overview of, 67–68 | 1-55 to 1-56, <b>I-55</b> | | memory addressing in, B-7 to | pipeline scheduling and, 75-79 | ripple-carry addition, I-2 to I-3, | | B-13, B-8, B-9, B-11, B-12, | processor comparisons for, | 1-3, I-42, I-44 | | B-13 | 179-181, <b>179, 180, 181</b> | shifting over zeros technique, I-45 | | memory protection and, 324-325 | in RISC-based machines, 2 | to I-47, <b>I-46</b> | | multimedia instructions, B-31 to | in simultaneous multithreading, | signed numbers, I-7 to I-10 | | B-32 | 173, 175 | speeding up addition, I-37 to I-44 | | operand type and size, B-13 to | SPEC92 benchmarks and, 156. | speeding up multiplication and | | B-14, <b>B-15</b> | 157 | division, I-44 to I-58 | | operations in the instruction set, | switch from, to TLP and DLP, 4 | systems issues, I-10 to I-13, I-11, | | B-14 to B-16, <b>B-15</b> , <b>B-16</b> | thread-level parallelism vs., 172 | I-12 | | organization of, 12 | value prediction in, 130 | integer registers, B-34 | | orthogonal, B-30, J-83, K-11 | instruction-level parallelism | integrated circuits | | procedure invocation options, | limitations, 154–165 | costs of, 21–25, <b>22, 23</b> | | B-19 to B-20 | data flow limit, 170 | dependability of, 25–28 | | reducing code size and, B-43, | finite registers and, 162–164, 163 | feature size in, 17 | | B-44 | hardware model, 154–156 | prices of, 25–29 | | variations in, B-43, <b>B-44</b> | imperfect alias analysis and, | technology growth in, 14 | | in VAX, J-65 to J-83 | 164–165, <b>164</b> | trends in power in, 17–19 | | vectorized, F-3, F-4 to F-6 | realistic branch and jump | wire delay in, 17 | | in virtual machines, 317, 319–320 | prediction effects, 160–162 | integrated instruction fetch units, | | instruction set complications, A-45 to | for realizable processors, | 126–127<br>Intel 80x86, J-45 to J-65 | | A-47 | 165–169, <b>166</b> | comparative operation | | instruction-level parallelism (ILP), | unnecessary dependences. | measurements, J-62 to J-64, | | 66–141. See also | 169–170 | J-63, J-64 | | instruction-level parallelism | WAR and WAW hazards through | conditional branch options in, | | limitations; pipelining | memory, 169 on window size and maximum | B-19 | | branch prediction buffers and, | issue count, 156–160, | development of, J-45 to J-46, J-64 | | 82–86, <b>83, 84, 85</b> | 166–167, <b>166</b> | to J-65 | | compiler techniques for exposing. 74–80, <b>75</b> | instructions per clock (IPC), 42, 253 | exceptions in, A-40 | | data dependences and, 68–70 | integer arithmetic | | | unta dependences and, 00-70 | meger arminente | | | Intel 80x86, J-45 to J-65 (continued) | functional units and instruction | interarrival times, 386 | |----------------------------------------------------------------|-----------------------------------------------------------------------|----------------------------------------------------------------| | floating point operations, J-52 to | issue in, G-41 to G-43, G-41 | interconnection networks. See also | | J-55, <b>J-54, J-61</b> | historical perspectives on, G-44 | clusters; networks | | guest OS in, 324 | Itanium 1 compared to, G-40 to | arbitration in, E-49 to E-50, E-49, | | instruction encoding, J-55 to J-56, | G-41 | E-56 (See also arbitration) | | J-56, J-57, J-58 | peak performance in, 52 | asynchronous transfer mode (See | | instruction set architecture in, | performance measurements of, | asynchronous transfer mode) | | 9-10, B-44 to B-45, <b>J-42</b> | 179–181, <b>179, 180, 181,</b> | buffer organizations, E-58 to E-60 | | integer operations in, J-50 to J-52, | G-43, <b>G-43</b> | centralized switched, E-30 to | | J-52, J-53 | SPECRatios for, 35, 37 | E-34, E-31, E-33, E-48 | | memory addressing in, B-8, C-56 | Sun T1 compared with, 253 | characteristics of, E-20, E-44, | | memory protection in, C-49, K-52 | Intel MMX, B-31 to B-32, J-46 | E-56 | | operand addressing | Intel Paragon, K-40 | composing and processing | | measurements, J-59 to J-62, | Intel Pentium | messages, E-6 to E-9, E-7 | | J-59 to J-62 | precise exceptions in, A-56 | compute-optimized, E-88 | | Pacifica revision to, 320, 339 | protection in, C-48, C-49 to C-52, | conceptual illustration of, E-3 | | registers and data addressing | C-51, C-55 | congestion management, E-64 to | | modes, J-47 to J-49, <b>J-48</b> , | register renaming in, 128 | E-66 | | J-49, J-50 | Intel Pentium 4, 131–138 | connectivity in, E-62 to E-63 | | RISC instruction sets in, B-3, J-45 | AMD Opteron compared to, | density-optimized vs. | | to J-65 | 136–138, <b>137, 138,</b> 334–335, | SPEC-optimized processors, | | top ten instructions in, <b>B-16</b> | 334 | E-85 | | variable instruction encoding in, | clock rate on, 139–141, <b>139</b> | distributed switched, E-34 to | | B-22 to B-23 | hardware prefetching in, 305, <b>306</b> | E-39, E- <b>36, E-37, E-40</b> , E-46 | | virtualization and, 320, 321, 339, | memory hierarchy of, 341 | in distributed-memory | | 340 | microarchitecture of, 131–132, | multiprocessors, 232 | | Intel 387, I-33 | 132, 133 | domains, E-3 to E-5, E-3, E-4 | | Intel ASCI Red Paragon, E-20, E-44, | multilevel inclusion in, C-34 | Element Interconnect Bus, E-3, | | E-56 | performance analysis of, | E-70, E-71 | | Intel IA-32 microprocessors, A-46 to | 133–138, <b>134 to 138, G-43</b> | Ethernet (See Ethernet) | | A-47, C-49 to C-51, <b>C-51</b> | prices of, <b>20</b> | fault tolerance in, E-66 to E-68, | | Intel IA-64 | signal propagation in, 17 | E-69, E-94 | | EPIC approach in, 118, G-33, | SRT division algorithm in, I-56, | historical perspectives on, E-97 to | | K-24 | I-57 | E-104 | | hardware-based speculation in, | tournament predictors in, 88 | IBM Blue Gene/L eServer, E-20, | | 171 | trace caches in, 296 | E-44, E-56, E-72 to E-74 | | historical perspectives on, G-44, | way prediction in, 295 | InfiniBand, E-4, E-64, E-74 to | | K-14 to K-15 | Intel Pentium 4 Extreme | E-77, E- <b>75, E-76</b> | | ILP performance limitations and, | power efficiency in, 18, 183 | internetworking, E-2, E-80 to | | 184 | SMT performance in, 177, | E-84, E-80 to E-84 | | implicit and explicit parallelism | 179–181, <b>179, 180, 181</b> | | | in, G-34 to G-37, <b>G-35</b> , | | I/O subsystem in, E-90 to E-91 latency and effective bandwidth | | G-36, G-37 | Intel Pentium 4 Xeon, 215 Intel Pentium D, 198, 255–257, <b>255</b> , | <u> </u> | | | | in, E-12 to E-20, E-13, E-19, | | instruction formats in, G-38, <b>G-39</b> page tables in, C-43 | 256, 257 | E-25 to E-29, <b>E-27</b> , <b>E-28</b> | | | Intel Pentium III, 183 | memory hierarchy interface | | register model in, G-33 to G-34 | Intel Pentium M, 20 | efficiency, E-87 to E-88 | | speculation support in, G-38, | Intel Pentium MMX, D-11 | performance of, E-40 to E-44, | | G-40 | Intel Thunder Tiger4, E-20, E-44, | <b>E-44,</b> E-52 to E-55, <b>E-54,</b> | | Intel iPSC 860, K-40 | E-56 | E-88 to E-92 | | Intel Itanium 1, G-40 to G-41, K-14 | Intel VT-x, 339–340 | protection and user access, E-86 | | Intel Itanium 2. See also Intel IA-64 | intelligent devices, K-62 | to E-87, <b>E-87</b> | | routing in, E-45 to E-48, E-46, | Internet Archive cluster, 392–397, | kernel process, 316 | |-----------------------------------------------|----------------------------------------------------------|----------------------------------------------------------| | E-52 to E-55, <b>E-54, E-56</b> | 394 | kernels<br>defined, 29 | | shared-media, E-21 to E-24, <b>E-22</b> , | in multiprogramming workload, | FFT, H-21 to H-29, <b>H-23 to H-26</b> , | | E-24 to E-25, E-78 | 225–227, <b>227</b> | H-28 to H-32 | | smart switches vs. smart interface | NetApp FAS6000 filer, 397–398 | Livermore FORTRAN, K-6 | | cards, E-85 to E-86, <b>E-86</b> | queuing theory, 379–382, <b>379</b> , | LU, <b>H-11</b> , H-21 to H-26, <b>H-23</b> to | | speed of, E-88 to E-89 | 381 | H-26, H-28 to H-32 | | standardization in, E-63 to E-64 | throughput vs. response time, | miss rate in multiprogramming | | structure of, E-9 to E-12 | 372–374, <b>373</b> , <b>374</b> | example, 227–228, <b>229</b> | | switch microarchitecture, E-55 to | transaction-processing | Kilburn, T., C-38, K-52 | | E-58, E-56, E-57, E-62 | benchmarks, 374–375, 375 | Kroft, D., K-54 | | switched-media, E-21, E-24, E-25 | virtual caches and, C-37 | Kuck, D., F-51, K-23 | | switching technique in, E-50 to | virtual machines and, 320–321, 339 | Kung, H. T., I-63 | | E-52, <b>E-56</b> | | Kung, 11. 1., 1-03 | | in symmetric shared-memory | write merging and, 301 | | | multiprocessors, 216–217, | I/O per second (IOPS), 395–396 | I I are her Consultant and another | | 216 | IPC (instructions per clock), 42, 253 | L1 cache. See also multilevel caches | | topologies in, E-40 to E-44, E-44 | iPSC 860, K-40 | in AMD Opteron, 327, <b>327</b> , <b>328</b> , | | zero-copy protocols, E-8, E-91 | Irwin, M. J., I-65 ISA. See instruction set architecture | 329, <b>C-55</b> | | interference, <b>D-21</b> , D-22 | | average memory access time and, | | intermittent faults, 367 | issue rates, in multiple-issue processors, 182 | 291 | | internal fragmentation, C-46 | issue slots, MLT and, 174–175, <b>174</b> | memory hierarchy and, 292 | | International Mobile Telephony 2000 | Itanium 1, G-40 to G-41, K-14 | miss penalties and, 291<br>multilevel inclusion and, 248 | | (IMT-2000), D-25 | * | | | Internet, E-81 | Itanium 2. See Intel Itanium 2 iterative arbiter, E-50 | size of, 294 | | Internet Archive, 392–397, <b>394</b> | | in Sun T1, 251, <b>251</b> | | Internet Protocol, E-81 | iterative division, I-27 to I-31, I-28 | in virtual memory, C-46 to C-47, | | internetworking, E-2, E-80 to E-84, | | C-47 L2 cache. See also multilevel caches | | E-80 to E-84. See also | J | in AMD Opteron, 327, <b>327</b> , <b>334</b> , | | interconnection networks | Java bytecodes, K-10 | | | interprocedural analysis, G-10 | Java Virtual Machine (JVM), K-10 | C-55 | | Interrupt Enable (IE) flag, 324 | JBOD, 362 | average memory access time and, 291 | | interrupts. See exceptions | JIT (just in time) Java compilers, K-10 | hardware prefetching and, 305 | | invalid exceptions, I-35 | Johnson, R. B., K-59 | memory hierarchy and, 292 | | invalidate protocols. See write | Jouppi, N. P., K-54 | miss penalties and, 291 | | invalidate protocols | JTAG networks, H-42, H-43 | multibanked caches, 299, 309 | | inverted page tables, C-43 | jump prediction, 155, 160–162, <b>160</b> , | multilevel inclusion and, 248 | | I/O, 371–379. See also buses; storage | 162 | size of, 293 | | systems | jumps, in control flow instructions, | speculative execution and, 325 | | asynchronous, 391 | B-16 to B-18, <b>B-17</b> , B-37 to | in Sun T1, 251–252, <b>251</b> | | cache coherence problem, | B-38, <b>B-38</b> | in virtual machines, 323, <b>323</b> | | 325–326 | just in time (JIT) Java compilers, K-10 | in virtual memory, C-46 to C-47, | | dependability benchmarks, 377–379, <b>378</b> | | C-47 | | | K | Lam, M. S., 170 | | design considerations, 392–393 | K6, 294 | lanes, F-6, <b>F-7</b> , F-29 to F-31, <b>F-29</b> , | | device requests, A-40, A-42 | Kahan, W., I-1, I-64 | F-30 | | disk accesses at operating | k-ary n-cubes, E-38 | LANS (local area networks), E-4, E-4, | | systems, 400–401, <b>401</b> | Keller, T. W., F-48 | E-77 to E-79, <b>E-78</b> , E-99 to | | evaluation of, 394–396 | Kendall Square Research KSR-1, | E-100. See also | | historical developments in, K-62 | K-41 | interconnection networks | | to K-63 | Kennedy, John F., K-1 | increomeetion networks | | | | | | large-scale multiprocessors, K-40 to K-44 | from remote access | load interlocks, A-33 to A-35, A-34, | |-------------------------------------------|-------------------------------------------------|---------------------------------------------------| | cache coherence implementation, | communication, 203–204 | A-63, <b>A-65</b> | | H-34 to H-41 | in shared-memory | load vector count and update | | classification of, H-44 to H-46, | multiprocessors, H-29<br>transport, E-14 | (VLVCU), F-18 | | H-45 | using speculation to hide, | load-linked (load-locked) instruction,<br>239–240 | | computation-to-communication | 247–248 | loads, advanced, G-40 | | ratios in, H-10 to H-12, H-11 | in vector processors, F-3, F-4, | | | hierarchical relationships in, | F-16, F-31 to F-32, <b>F-31</b> | loads, applied, E-53 | | H-45, <b>H-46</b> | latent errors, 366–367 | load-store architecture, B-3 to B-6, B-4, B-6 | | IBM Blue Gene/L as, H-41 to | law of diminishing returns, Amdahl's | load-store ISAs, 9, F-6, <b>F-7</b> , F-13 to | | H-44, <b>H-43</b> , <b>H-44</b> | Law and, 40 | F-14 | | interprocessor communication in, | LCD (liquid crystal displays), D-19 | local address space, C-50 | | H-3 to H-6 | learning curve, costs and, 19 | local area networks (LANS), E-4, E-4. | | limited buffering in, H-38 to H-40 | least common ancestor, E-48 | E-77 to E-79, E-78, E-99 to | | message-passing vs. | least-recently used (LRU) blocks, C-9, | E-100. See also | | shared-memory | <b>C-10,</b> C-14, C-43 | interconnection networks | | communication in, H-4 to | Leighton, F. T., I-65 | local miss rate, C-30 to C-33, C-32 | | H-6 | limit fields, C-50 | local nodes, 232, <b>233</b> | | scientific/technical computing in, | line locking, D-4 | local optimizations, B-26, <b>B-28</b> | | H-6 to H-12, H-11 | linear speedups, 259–260, <b>260</b> | local scheduling, 116 | | synchronization mechanisms in, | link injection bandwidth, E-17 | locks | | H-17 to H-21, H-19, H-21 | link pipelining, E-16, E-92 | queuing, H-18 to H-20 | | synchronization performance | link reception bandwidth, E-17 | spin lock with exponential | | challenges, H-12 to H-16, | link registers, 240, J-32 to J-33 | back-off, H-17 to H-18, <b>H-17</b> | | H-14, H-15, H-16 | link-level flow control, E-58, E-62, | lockup-free (non-blocking) caches, | | latency | E-65, E-72, E-74 | 296–298, <b>297</b> , <b>309</b> , K-54 | | defined, 15, 28, C-2 | Linpack benchmark, F-8 to F-9, F-37 | logical units, 390-391 | | in distributed-memory | to F-38 | logical volumes, 390–391 | | multiprocessors, 201 | Linux | lognormal distribution, 36–37 | | effective bandwidth and, E-25 to | dependability benchmarks, | loop interchange, 302-303 | | E-29, <b>E-27</b> , <b>E-28</b> | 377–379 <b>, 378</b> | loop unrolling | | in Element Interconnect Bus, | Xen VMM compared with, | dependence analysis and, G-8 to | | E-72 | 322–324, <b>322, 323</b> | G-9 | | in floating-point MIPS pipelining, | liquid crystal displays (LCD), D-19 | eliminating dependent | | A-48 to A-49, <b>A-49</b> , <b>A-50</b> , | LISP, J-30 | computations in, G-11 | | <b>A-62,</b> A-65 | literal addressing mode, <b>B-9</b> , B-10 to | global code scheduling and, 116, | | hiding, H-4 | B-13, <b>B-11</b> , <b>B-12</b> , <b>B-13</b> | G-15 to G-23, G-16, G-20, | | improvements in, compared with | Little Endian, B-7, B-34, J-49 | G-22 | | bandwidth, 15, <b>16</b> | Little's Law, 380-381, 385 | hardware-controlled prefetching | | interconnected nodes and, E-27 | livelock, E-45 | and, 306 | | in interconnection networks, E-12 | liveness, 74 | pipeline scheduling and, 75–80, | | to E-19, <b>E-13</b> , E-25 to E-29, | Livermore FORTRAN Kernels, K-6 | 75 | | E-27 | load and store instructions | recurrences and, G-11 to G-12 | | I/O, 371 | in instruction set architecture | software pipelining as symbolic, | | in Itanium 2, G-41 | classification, B-3, <b>B-4</b> | G-12 to G-15, <b>G-13</b> , <b>G-15</b> | | latency overlap, C-20 | in MIPS architecture, B-36, B-36 | loop-carried dependences, G-3 to G-5 | | main memory, 310 | in RISC instruction set, A-4 | loop-level parallelism | | packet, E-40 to E-41, E-52 to | load buffers, 94–95, <b>94,</b> 97, <b>101,</b> | eliminating dependent | | E-53 | 102–103 | computations, G-10 to G-12 | | in pipelining, 75–79, A-10 | load delays, 2-cycle, A-59, A-59 | finding dependences, G-6 to G-10 | | greatest common divisor test, G-/ | magnetic storage, history of, K-59 to | immediate values, B-13 | |----------------------------------------|---------------------------------------|--------------------------------------------| | increasing ILP through, 67-68 | K-61. See also storage | in instruction set architectures, | | interprocedural analysis, G-10 | systems | 910 | | points-to analysis, G-9 to G-10 | main memory, DRAM chips in, 310 | interpreting addresses, B-7 to B-8, | | recurrences, G-5 to G-10 | Mark-I, K-3 | R-8 | | | | 2 0 | | type of dependences in, G-2 to | MasPar, K-35 | memory alias analysis, 155, 164–165, | | G-6 | massively parallel processors (MPP), | 164 | | loops. See also loop unrolling | H-45 | memory banks, in vector processors, | | barrier synchronization in, H-14 | matrix operations, H-7, I-32 to I-33 | F-14 to F-16, <b>F-15</b> | | chaining, F-35, <b>F-35</b> | Mauchly, John, K-2 to K-3, K-5 | memory consistency models, 243-248 | | conditionals in, F-25 to F-26 | maximum transfer unit, E-7, E-76 | coherence and, 207 | | dependence distance in, G-6 | maximum vector length (MVL), F-17, | compiler optimization and, | | execution time of vector loop, | F-17 | 246–247 | | F-35 | Mayer, Milton, E-1 | development of, K-44 | | lossless networks, E-11, E-59, E-65, | McLuhan, Marshall, E-1 | relaxed consistency, 245-246 | | E-68 | Mead, C., I-63 | sequential consistency, 243-244, | | lossy networks, E-11, E-65 | mean time between failures (MTBF), | 247 | | LRU (least-recently used) blocks, C-9, | 26 | synchronized programs and, | | = | | 244–245 | | <b>C-10,</b> C-14, C-43 | mean time to failure (MTTF), 26–27, | | | LU kernels | 51, 362, 396–397 | using speculation to hide latency | | characteristics of, H-8, H-11 | mean time to repair (MTTR), 26–27, | in, 247–248 | | on distributed-memory | 362, 364–365 | memory hierarchy, 287-342, C-1 to | | multiprocessors, H-28 to | media | C-58. See also cache | | H-32 | extensions, D-10 to D-11, <b>D-11</b> | optimizations; virtual | | on symmetric shared-memory | physical network, E-9 | memory | | multiprocessors, H-21 to | shared, E-21, E-23, E-24, E-78 | AMD Opteron data cache | | H-26, H-23 to H-26 | switched, E-21, E-24, E-25 | example, C-12 to C-14, C-13, | | | memory. See also caches; virtual | C-15 | | AA | memory | average memory access time, 290 | | M | in embedded computers, 7–8, D-2 | block addressing, 299, <b>299</b> , C-8 to | | M32R | to D-3 | C-9, C-8 | | addressing modes in, J-5 to J-6, | | * | | J-6 | in interconnection networks, | block placement in caches, C-7 to | | architecture overview, J-4 | 216–217, <b>216</b> | C-8, <b>C-7</b> | | common extensions in, J-19 to | in vector processors, F-14 to F-16, | block replacement with cache | | J-24, <b>J-23, J-24</b> | <b>F-15, F-22 to F-23</b> | misses, C-9, C-10, C-14 | | instructions unique to, J-39 to | virtual (See virtual memory) | cache optimization summary, 309, | | J-40 | memory access pipeline cycle, A-44, | 309 | | MIPS core subset in, J-6 to J-16. | <b>A-51,</b> A-52 | cache organization overview, | | J-8, J-9, J-14 to J-17 | memory access/branch completion | 288–293, <b>292</b> | | multiply-accumulate in, J-19, | cycle | cache performance review, C-3 to | | J-20 | in RISC implementation, A-6 | C-6, C-15 to C-21, <b>C-21</b> | | <del>-</del> | in shared-memory | cache size and hit time, 293-295, | | MAC (multiply-accumulate), D-5, | multiprocessors, H-29 to | 294, 309 | | D-8, J-18, <b>J-20</b> | H-30, <b>H-32</b> | compiler optimizations and, | | machine language programmers, J-84 | | 302–305, <b>304, 309</b> | | machine memory, defined, 320 | in unpipelined MIPS | | | Macintosh, memory addressing in, | implementation, A-28, A-29 | compiler-controlled prefetching, | | K-53 | memory accesses per instruction, C-4 | 305–309, <b>309</b> | | magnetic cores, K-4 | to C-6 | critical word first and early restart. | | magnetic disks. See disk storage; | memory addressing, B-7 to B-13 | 299–300, <b>309</b> | | RAID | addressing modes, B-9 to B-11, | hardware prefetching and, 305, | | | <b>B-9, B-11, B-12,</b> J-47 | 306, 309 | | | | | | memory hierarchy (continued) | safe calls from user to OS gates, | MIPS (million instructions per | |------------------------------------------------------------|------------------------------------------------------------------------------|--------------------------------------------------------------------------| | historical perspectives on, K-52 to | C-52 | second), 169, A-4 to A-5, | | K-54 | by virtual machines, 317-324, | <b>B-19,</b> K-6 to K-7 | | in IBM Power 5, <b>341</b> | 322, 323 | MIPS 1, <b>J-1</b> , <b>J</b> -6 to J-16, <b>J-7</b> , <b>J-9 to</b> | | in Intel Pentium 4, 341 | by virtual memory, 315-317, | J-13, J-17 | | main memory, 310 | C-39, C-47 to C-52, C-51 | MIPS 16 | | merging write buffers and,<br>300–301, <b>301, 309</b> | memory reference speculation, G-32,<br>G-40 | addressing modes in, J-5 to J-6, <b>J-6</b> | | of microprocessors compared, | memory stall cycles, C-4 to C-6, C-20 to C-21, <b>C-21</b> | architecture overview, <b>J-4</b> common extensions in, J-19 to | | multibanked caches, 298–299, <b>299, 309</b> | memory technology, 310-315. See | J-24, J <b>-23, J-24</b> | | * | also storage systems | features added to, J-44 | | multilevel inclusion in, C-34 nonblocking caches, 296–298, | DRAM, 311–315, <b>311, 314</b> flash memory, 359–360 | instructions unique to, J-40 to J-42 | | 297, 309 | SRAM, 311, F-16 | MIPS core subset in, J-6 to J-16, | | operating system impact on, C-56,<br>C-57 | memory-constrained scaling, H-33 to<br>H-34 | J-8, J-9, J-14 to J-17 multiply-accumulate in, J-19, | | pipelined cache access and, 296, | memoryless processes, 384, 386 | J-20 | | 309 | memory-memory architecture, B-3, | reduced code size in, B-23 | | projections of processor | B-5, <b>B-6</b> | MIPS 32, <b>J-80</b> | | performance, 288, <b>289</b> | memory-memory vector processors, | MIPS 64 | | sizes and access times of levels in, | F-4, F-44, F-48 | addressing modes in, J-5 to J-6, | | C-3 | mesh networks, E-36, E-40, E-46; | J-5 | | speculative execution and, 325 | E-46 | common MIPS extensions in, J-19 | | in Sun Niagara, 341 | MESI protocol, 213 | to J-24, <b>J-21 to J-23</b> | | thrashing, C-25 | Message Passing Interface (MPI), H-5 | instruction set architecture, 10, | | trace caches and, 296, 309 | message-passing communication, H-4 | 11, 12, A-4, B-33, B-34, <b>B-40</b> | | typical multilevel, 288 | to H-6 | unique instructions in, J-24 to | | in virtual memory, C-40, C-41, | message-passing multiprocessors, 202 | J-27, J <b>-26</b> | | C-42 to C-44, C-43, C-46,<br>C-47 | messages, E-6 to E-8, E-7, E-76, E-77<br>MFLOPS ratings, F-34 to F-35, F-37, | MIPS architecture, B-32 to B-39 addressing modes for, B-34 to | | way prediction and, 295, 309 | K-7 | B-35 | | writes in caches, C-9 to C-12, | microinstruction execution, pipelining, | ALU instructions, A-4 | | C-10 memory indirect addressing mode, | A-46 to A-47 microprocessors | common extensions in, J-19 to J-24, <b>J-21 to J-24</b> | | B-9, B-11 | costs of, 20, <b>20</b> | control flow instructions, B-37 to | | memory mapping, C-40. See also | memory hierarchy in, 341 | B-38, <b>B-38</b> | | address translations | performance milestones in, 15, 16 | data types for, B-34 | | memory pipelines, on VMIPS, F-38 to F-40 | transistor performance improvements in, 17 | in embedded multiprocessors,<br>D-14 to D-15, D-17 | | memory ports, pipeline stalls and, | migration of shared data, 207 | floating-point operations in, B-38 | | A-13, <b>A-14</b> , <b>A-15</b> | MIIPS MDMX, J-16 to J-19, <b>J-18</b> | to B-39, <b>B-40</b> | | memory protection, 315–324, C-47 to C-55 | MIMD. See multiple instruction streams, multiple data | instruction format, B-35, <b>B-35</b> instruction set usage, 9–10, B-39, | | in 64-bit Opteron, C-53 to C-55, | streams | B-41, B-42 | | C-54 | minicomputers, beginnings of, 3, 4 | operations supported by, B-35 to | | architecture provisions for, 316 | minimal paths, E-45, E-67 | B-37, <b>B-36, B-37</b> | | development of, K-52 instruction set architecture and, | MIPS (Microprocessor without | processor structure with | | 324–325 | Interlocked Pipeline Stages),<br>J-82, <b>J-82,</b> K-12 | scoreboard, A-68 to A-69, A-68 | | | | | | | | the second secon | |-----------------------------------------------|-------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | recommendations for, B-33 | mirroring, 362, <b>363</b> , K-61 to K-62 | in instruction vs. data caches, | | registers for, B-34 | misalignment, address, B-7 to B-8, | C-14, C-15 | | Tomasulo's approach in, 94 | B-8 | local vs. global, C-30 to C-31 | | unpipelined implementation of, | MISD (multiple instruction streams, | main categories of, 290 | | A-26 to A-30, <b>A-29</b> | single data stream), 197 | measurement of, C-4 to C-5 | | vector, F-4 to F-6, <b>F-5, F-7, F-8</b> | misprediction rate | in multilevel caches, C-30 to | | MIPS M2000, J-81, <b>J-82</b> , K-13 to | in Alpha 21264, 89, 140 | C-33, C-32 | | K-14, <b>K-14</b> | from branch-prediction buffers, | process-identifier tags and, C-36, | | MIPS pipeline | 82, <b>83</b> , 84 | C-37 | | basic, A-30 to A-33, A-31, A-32 | for correlating predictors, 86, 87, | misses per instruction, 290, C-5 to | | branch hazards, A-21 | 88 | C-6, C-30 to C-31. See also | | branches in, A-35 to A-37, A-38, | in Pentium 4, 133–134, <b>134</b> | miss rates | | A-39 | in static branch prediction, 81, 81 | Mitchell, David, K-37 | | control implementation in, A-33 | value prediction and, 130 | Mitsubishi M32R. See M32R | | to A-35, A-34 | miss latency, C-20 | mixed caches, C-14 | | exceptions in, A-38 to A-41, A-40, | miss penalties | M/M/I queues, 386 | | <b>A-42,</b> A-43 to A-45, <b>A-44</b> | block size and, C-26 | M/M/m multiple-server model, | | floating-point in, A-47 to A-56, | compiler-controlled prefetching, | 388–389, <b>388</b> | | A-48 to A-51, A-57, A-58, | 305–309, <b>309</b> | MMX, B-31 to B-32, J-46 | | A-60 to A-62, <b>A-61</b> to <b>A-63</b> | CPU time and, C-18 | Modula-3, I-12 | | ILP limitations in, 167–169 | critical word first and early restart, | module availability, 26 | | instruction set complications, | 299–300, <b>309</b> | module reliability, 26, 49 | | A-45 to A-47 | equation for, 168 | modulo scheduling, K-23 | | loop unrolling in, 76–79 | hardware prefetching and, 305, | MOESI protocol, 213 | | MIPS R4000 example, A-56 to | 306, 309 | Moore's Law, 312 | | A-65, <b>A-58</b> to <b>A-65</b> | memory stall cycles and, C-4 to | Mosaic, E-98 | | out-of-order executions in, A-66 | C-6 | Motorola 680x0, A-40, J-42, K-53 | | to A-67 | multilevel caches and, 291, C-15 | Motorola 68882, I-33 | | scoreboarding technique in, A-66 | to C-16, C-29 to C-34, C-32 | MPI (Message Passing Interface), H-5 | | to A-75, A-68, A-71 to A-75 | nonblocking caches and, | MPP (massively parallel processors), | | stopping and restarting execution | 296–298, <b>297, 309</b> | H-45 | | in, A-41 to A-43 | in out-of-order processors, C-19 | MSP (Multi-Streaming Processors), | | MIPS № 1000, 247 | to C-21, C-21 | F-40 to F-41, <b>F-41</b> , F-43 | | MIPS ½2000/3000, A-56 | read misses and, 291, C-34 to | MTBF (mean time between failures), | | MIPS R3000, I-12 | C-35 | 26 | | MIPS R3010 chip, I-58 to I-60, <b>I-58</b> , | in virtual memory, C-40, C-42 | MTTF (mean time to failure), 26-27, | | I-59 | miss rates | 51, 362, 396–397 | | MIPS R4000 pipeline, A-56 to A-65 | associativity and, 291, C-28 to | MTTR (mean time to repair), 26-27, | | development of, K-19 | C-29, <b>C-29</b> | 362, 364–365 | | eight-stage structure of, A-56 to | average memory access time and, | multibanked caches, 298-299, 299, | | A-58, A-58, A-59, A-60 to | C-15 to C-16 | 309 | | A-62, <b>A-61</b> | block size and, 291, C-25 to C-28, | multicasting, E-24 | | floating-point pipeline, A-60 to | C-26, C-27 | multicomputers, defined, K-39 | | A-62, <b>A-61</b> , <b>A-62</b> , <b>A-63</b> | cache size and, 291, C-28 | multicore processors | | forwarding and branch delays in, | compiler optimizations and, | Element Interconnect Bus, E-70 | | A-59 to A-60, <b>A-59, A-60</b> | 302–305, <b>304, 309</b> | to E-72, <b>E-71</b> | | performance of, A-63 to A-65, | compiler-controlled prefetching, | MINS compared with, E-92 | | A-64, A-65 | 305–309, <b>309</b> | origin of name, 198 | | MIPS R8000, A-43 | CPU time and, C-18 | performance on SPEC | | MIPS R10000, A-45 | defined, C-4 | benchmarks, 255–257, <b>255</b> , | | | hardware prefetching and, 305. | 256, 257 | | MIPS R12000, 128 | 306, 309 | | | | 300, 307 | | | Multiflow processor, K-22 to K-23 | speeding up, I-47 to I-50, <b>I-48</b> , | optimizing software for, 261–262 | |------------------------------------------------------|---------------------------------------------|-----------------------------------------------------------------| | multigrid methods, H-9 to H-10 | I-49 | reasons for rise of, 262–264 | | multilevel caches, 291, C-21, C-29 to | system issues in, I-11 | references on, K-44 to K-45 | | C-34, C-32, C-39. See also | of two's complement numbers, I-8 | SMP performance, 218-227, 222 | | caches; L1 cache; L2 cache | multiply trees, I-52 to I-53, I-53 | to 226 | | multilevel exclusion, C-34 | multiply-accumulate (MAC), D-5, | snooping protocols, 208–209, | | multilevel inclusion, 248-249, C-34, | D-8, J-18, <b>J-20</b> | <b>209</b> , 216–218 | | K-54 | multiply-step instruction, I-11 to I-12 | SPEC benchmark performance, | | multilevel page tables, C-53, C-54 | multiprocessing. See also distributed | 255–257, <b>255, 256, 257</b> | | multimedia support, J-16 to J-19, J-18, | shared-memory | synchronization in, 237–242, <b>242</b> | | J-46 | multiprocessors; large-scale | T1 processor performance, | | multipath fading, D-21 | multiprocessors; symmetric | 249–254, <b>250 to 254</b> | | multiple instruction streams, multiple | shared-memory | taxonomy of, 197–201, <b>201</b> | | data streams (MIMD) | multiprocessors | in vector processors, F-43 | | advantages of, 198 | advantages of, 196 | multiprogrammed workloads, | | Amdahl's Law and, 258-259 | bus-based coherent, K-38 to K-40 | 225–230, <b>227</b> , <b>228</b> , <b>229</b> | | centralized shared-memory | cache coherence protocols, | multiprogramming, C-47 to C-48 | | architectures, 199–200, <b>200</b> | 205–208, <b>206</b> , 211–215, <b>213</b> , | multistage interconnection networks, | | clusters in, 198 | 214 | E-30, E-92 | | distributed-memory architectures, | centralized shared-memory | Multi-Streaming Processors (MSP), | | 200–201, <b>201</b> | architectures, 199–200, <b>200</b> | F-40 to F-41, <b>F-41</b> , F-43 | | historical perspectives on, K-36 | challenges of, 202–204 | multithreading, 172–179. See also | | multicore, 198, 199 | classes of, K-43 | thread-level parallelism | | multiple instruction streams, single | data stream numbers in, K-38 | coarse-grained, 173–174, <b>174</b> , | | data stream (MISD), 197 | defined, K-39 | K-26 | | multiple-issue processors, | directory-based coherence in, | development of, K-26 to K-27 | | development of, K-20 to | 234–237, <b>235, 236</b> | in the directory controller, H-40 | | K-23. See also superscalar | distributed shared memory in, | fine-grained, 173–175, <b>174</b> | | processors; VLIW processors | 230–234, <b>232, 233</b> | overview, 172–173, 199 | | multiple-precision addition, I-13 | in embedded systems. D-3, D-14 | parallel processing and, 253, 254 | | multiplexers | to D-15 | processor comparison for, | | in floating-point pipelining, A-54 | historical perspectives on, K-34 to | 179–181, <b>179, 180, 181</b> | | in MIPS pipelining, A-31, A-33, | K-45 | processor limitations in, 181–183 | | A-35, <b>A-37</b> | invalidate protocol | simultaneous, 173–179, <b>174, 178</b> | | in set-associative caches, C-18 | implementation, 209–211 | in Sun T1 processor, 250–252, | | multiplication | large-scale, K-40 to K-44 (See | 251, 252 | | faster multiplication with many | also large-scale | MVL (maximum vector length), F-17, | | adders, I-50 to I-54, I-50 to | multiprocessors) | F-17 | | I-54 | limitations in, 216–217 | MXP processor, D-14 to D-15 | | faster multiplication with single | memory consistency models in, | Myrinet switches, K-42 | | adders, I-47 to I-50, <b>I-48</b> , | 243–248 | Myrinet-2000, <b>E-76</b> | | I-49 | message-passing, 202 | Wtyrmet-2000, <b>E-70</b> | | floating-point, I-17 to I-21, I-18, | models for communication and | N | | I-19, I-20 | memory architecture, | | | higher-radix, I-48 to I-49, <b>I-49</b> | 201–202 | NAK (negative acknowledgment), | | operands of zero, I-21 | multilevel inclusion, 248–249 | H-37, H-39 to H-41 | | precision of, I-21 | multiprogramming and OS | name, defined, 70 | | radix-2 integer, I-4 to I-7, <b>I-4</b> , <b>I-6</b> | workload performance, | name dependences, 70–71. See also | | shifting over zeros technique, I-45 | 227–230, <b>228, 229</b> | antidependences; output | | with single adders, I-47 to I-50, | nonuniform memory access in, | dependences | | 1-48 | 202 | NaN (Not a Number), I-14, I-16<br>NAS parallel benchmarks, F-51 | | NaT (Not a Thing), G-38, G-40 | lossy, E-11, E-65 | shared-memory (DSM) | |------------------------------------------------------|--------------------------------------------|----------------------------------------| | natural parallelism, 172, D-15 | mesh, E-36, E-40, E-46, E-46 | multiprocessors | | n-body algorithms, H-8 to H-9 | multistage interconnection, E-30, | nonunit strides, F-21 to F-22, F-46, | | n-cube, E-36 | E-92 | F-48 | | NEC SX/2, <b>F-7, F-34,</b> F-49 | on-chip, E-3, E-4, E-70, E-73, | normal distribution, 36 | | NEC SX/6, <b>F-7</b> , F-51 | E-103 to E-104 | no-write allocate, C-11 to C-12 | | NEC SX/5, F-7, F-50 | performance and cost of, E-40 | nullification, A-24 to A-25, J-33 to | | NEC SX/7, 338, <b>339</b> | shared link, E-5 | J-34 | | NEC SX/8, <b>F-7</b> , F-51 | shared-media, E-21 to E-24, E-22, | n-way set associative cache placement, | | NEC VR 4122, D-13, <b>D-13</b> | E-78 | 289, <b>C-7</b> , C-8 | | NEC VR 5432, D-13, <b>D-13</b> | storage areas, E-3, E-102 to E-103 | | | negative acknowledgment (NAK), | switched point-to-point, E-5 | 0 | | H-37, H-39 to H-41 | system areas, E-3, E-72 to E-77, | occupancy, message, H-3 to H-4 | | negative numbers, I-12, I-12, I-14 | <b>E-75 to E-77,</b> E-100 to E-102 | Ocean application | | nest page tables, 340 | wide area, E-4, E-4, E-75, E-79, | characteristics of, H-9 to H-12, | | NetApp FAS6000 filer, 397-398 | E-97 | H-11 | | Netburst design, 131, 137 | wireless. D-21 to D-22, <b>D-21</b> | on distributed-memory | | Network Appliance, 365, 391, | NEWS communication, E-41 to E-42 | multiprocessors, H-28, H-28 | | 397–398 | Newton's iteration, I-27 to I-29, I-28 | to H-32, H-30 | | network attached storage (NAS) | NFS (Network File System), 32 | on symmetric shared-memory | | devices, 391 | NFS (network file service), 376, 376 | multiprocessors, H-21 to | | network bandwidth | Ngai, TF., I-65 | H-26, H-23 to H-26 | | congestion management and, | Niagara, K-26 | OCN (on-chip networks), E-3, E-4, | | E-65 | NIC (network interface cards), 322, | E-70, <b>E-73,</b> E-103 to E-104 | | performance and, E-18, E-26 to | <b>322,</b> E-87, <b>E-88</b> | octrees, H-9 | | E-27, E-52 to E-55, E-89, | Nicely, Thomas, 1-64 | off-load engines, E-8, E-77, E-92 | | E-90 | Nintendo-64, F-47 | offset, in RISC architectures, A-4 to | | switching and, E-50 to E-52 | nodes | A-5 | | topologies and, E-40 to E-41 | home, 232, <b>233</b> | OLTP. See online transaction | | network file service (NFS), 376, 376 | in IBM Blue Gene/L, H-42 to | processing | | Network File System (NFS), 391 | H-44, <b>H-43, H-44</b> | Omega, E-30, <b>E-31</b> | | network interface, E-6, E-62, E-67, | interconnected, E-27 | on-chip multiprocessing, 198, 205. See | | E-76, E-90 | local, 232, <b>233</b> | also multicore processors | | network interface cards (NIC), 322, | remote, 233, <b>233</b> | on-chip networks (OCN), E-3, E-4, | | <b>322,</b> E-87, <b>E-88</b> | X1, F-42, <b>F-42</b> | E-70, <b>E-73</b> , E-103 to E-104 | | Network of Workstations, K-42 | nonaffine array indexes, G-6 | one's complement system, I-7 | | network reconfiguration, E-66 | nonaligned data transfers, J-24 to J-26, | one-third-distance rule of thumb, | | network-on-chip, E-3 | J-26 | 401–403, <b>403</b> | | networks. See also interconnection | nonatomic operations, 214 | one-way conflict misses, C-25 | | networks | nonbinding prefetches, 306 | online transaction processing (OLTP) | | centralized switched, E-30 to | nonblocking caches, 296–298, 297, | benchmarks for, 374–375, <b>375</b> | | E-34, <b>E-31</b> , <b>E-33</b> , E-48 | <b>309</b> , K-54 | performance evaluation for, 46, | | dedicated link, E-5, E-6, E-6 | non-blocking networks, E-32, E-35, | 47, 48 | | direct, E-34, E-37, E-48, E-67, | E-41, E-56 | in shared-memory | | E-92 | nonfaulting prefetches, 306 | multiprocessors, 220-224, | | distributed switched, E-34 to | non-minimal paths, E-45 | 221, 222 | | E-39, <b>E-36</b> , <b>E-37</b> , <b>E-40</b> , E-46 | nonrestoring division algorithm, I-5 to | OOO (out-of-order) processors, miss | | dynamic reconfiguration, E-67 | I-7. <b>I-6,</b> I-45 to I-47, <b>I-46</b> | penalty and, C-19 to C-21, | | indirect, E-31, E-48, E-67 | nonuniform memory access (NUMA) | C-21 | | local area, E-4, <b>E-4</b> , E-77 to E-79. | multiprocessors, 202. See | opcode field | | <b>E-78</b> , E-99 to E-100 | also distributed | encoding instruction sets, B-21 to | | lossless, E-11, E-59, E-65, E-68 | | B-24, <b>B-22</b> | | | | | #### I-26 Index | opcode field ( <i>continued</i> )<br>in MIPS instruction, B-35, <b>B-35</b><br>operand type and, B-13 | out-of-order execution, 90–91, A-66 to<br>A-67, A-75 to A-76, C-3. See<br>also scoreboarding | memory protection and, 316–317 multilevel, C-53 to C-54, C-54 nested, 340 | |-------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|---------------------------------------------------------------------------| | Open Systems Interconnect (OSI), | out-of-order (OOO) processors, miss | page sizes and, C-45, C-53 | | E-81, <b>E-82</b> | penalties and, C-19 to C-21, | paging of, C-44 | | OpenMP consortium, H-5 | C-21 | process protection and, C-48 | | operands | output dependences, 71, K-23 | shadow, 320 | | address specifiers for, B-21 | output-buffered switches, E-57, E-59 | in virtual memory mapping, C-43 | | decimal, B-14 | overflow, I-8, I-10 to I-12, <b>I-11</b> , I-20 | page table entries (PTEs) | | in instruction encoding, <b>B-22</b> | overhead | in AMD Opteron, 326–327, C-54 | | instruction set architecture | occupancy and, H-4 | in Intel Pentium, C-50 | | classification and, B-3, <b>B-4</b> , | packet switching and, E-51 | in virtual memory, C-43, <b>C-43</b> | | B-5, <b>B-6</b> | receiving, E-14, E-17, E-63, E-76, | paged segments, C-41, C-42 | | in Intel 80x86, J-59 to J-62, <b>J-59</b> | E-88, E-92 | page-level protection, C-36 | | to J-62 | routing algorithms and, E-48 | pages | | in MIPS, 10 | sending, E-14, E-16, E-63, E-76, | in 64-bit Opteron memory | | shifting, J-36 | E-92 | management, C-53 to C-55, | | type and size of, B-13 to B-14, | overlapping triplets, I-49, <b>I-49</b> | C-54 | | B-15 | overlays, C-39 | in virtual memory, C-30, C-41 to | | in VAX, J-67 to J-68, <b>J-68</b> | owner of a cache block, 211, 231, <b>235</b> | C-42, C-41, C-42 | | operating systems | owner of a cache block, 211, 251, 255 | paired single operations, B-39, D-10 to | | asynchronous I/O and, 391 | Р | D-11 | | disk accesses in, 400–401, <b>401</b> | - | PAL code, J-28 | | memory hierarchy performance | Pacifica, 320, 339 | Panda, D. K., <b>E-77</b> | | and, C-56, C-57 | packets | Paragon, K-40 | | multiprogrammed workload | in asynchronous transfer mode, | parallel, defined, 68 | | performance, 225–230, <b>227</b> , | E-79 | parallel processing | | 228, 229 | discarding, E-65 | historical perspectives on, K-34 to | | page size changes and, C-56 to | in Element Interconnect Bus,<br>E-71 to E-72 | K-36 | | C-57 | | in large-scale multiprocessors, | | Ultrix, C-37 | headers, E-6, E-48, E-52, E-57 to<br>E-58, E-60, E-72 | Н-2 | | user access to, C-52 | in IBM Blue Gene/L 3D Torus, | performance with scientific | | in virtual machines, 318, 319, 320 | E-72 | applications, H-33 to H-34 | | operation faults, 367, <b>370</b> | in InfiniBand, E-76 | parallelism. See also instruction-level | | operations, in instruction sets, B-14 to | latency, E-12, <b>E-13</b> | parallelism; thread-level | | B-16, <b>B-15, B-16</b> | size of, E-18, E-19 | parallelism | | operator dependability, 369–371 | switching, E-50, E-77 | Amdahl's Law and, 258–259 | | Opteron processor. See AMD Opteron | trailers, E-6, <b>E-7</b> , E-61 | challenges of, 202-204 | | processor | transport, E-8 to E-9, E-94 | data dependences and, 68-70 | | order of instruction exceptions, A-38 | packing operations, B-14 | data-level, 68, 197, 199 | | to A-41, <b>A-40</b> , <b>A-42</b> | Padua, D., F-51 | at the detailed digital design level, | | organization, defined, 12 | page allocation, 262 | 38 | | orthogonal architectures, B-30, J-83, | page coloring, C-37 | explicit, G-34 to G-37, G-35, | | K-11 | page faults, C-3, C-40 | G-36, G-37 | | OS. See operating systems | page offsets, 291, C-38, C-42 | hardware vs. software approach | | OSI (Open Systems Interconnect), | page remapping, in virtual machines. | to, 66 | | E-81, <b>E-82</b> | 324 | historical perspectives on, K-24 to | | Otellini, Paul, 195 | page sizes, C-45 to C-46, C-56 to C-57 | K-25, K-34 to K-36 | | out-of-order completion, 90–91, A-54, | page tables | implicit, G-34 | | A-66 | inverted, C-43 | at the individual processor level, | | | mvented, C-43 | 37 | | multithreading and, 253, <b>254</b> | Pentium MMX, <b>D-11</b> | of vector processors, F-34 to F-38, | |------------------------------------------|-------------------------------------------|----------------------------------------| | natural, 172, D-15 | Perfect Club benchmarks, F-51 | <b>F-35,</b> F-40, F-44 to F-45, | | in scoreboarding, A-74 | perfect-shuffle permutation, E-30 | F-45 | | at the system level, 37 | performance. See also benchmarks: | virtual channels and, E-93 | | taxonomy of, 197-201, 200, 201 | cache performance; processor | in VMIPS, F-36 to F-38 | | in vector processing, F-29 to F-31, | performance | periodic functions, I-32 | | F-29, F-30 | Amdahl's Law and, 184 | permanent failures, E-66 | | paravirtualization, 321–324 | average memory access time and, | permanent faults, 367 | | PA-RISC | C-17 to C-19 | PetaBox GB2000, 393, 394 | | common MIPS extensions in, J-19 | bandwidth and, E-16 to E-19, | phase-ordering problem, B-26 | | to J-24, <b>J-21 to J-23</b> | <b>E-19,</b> E-25 to E-29, <b>E-28,</b> | phases (passes), optimizing, B-25, | | conditional branch options in, | E-89, <b>E-90</b> | B-25 | | B-19 | of branch schemes, A-25 to A-26, | phits, E-60, E-62, E-71 | | extended precision in, I-33 | A-26 | physical caches, defined, C-36 | | features added to, <b>J-44</b> | cache misses and, C-17 to C-19 | physical channels, E-47 | | instructions unique to, J-33 to | cache size and, H-22, H-24, <b>H-24</b> , | physical memory, in virtual machines, | | J-36, <b>J-34</b> | H-27, <b>H-28</b> | 320 | | MIPS core subset in, J-6 to J-16, | of commercial workload, | physical volumes, 390–391 | | J-7, J-9 to J-13, J-17 | 220–230, <b>221 to 229</b> | pi (p) computation, I-32 | | PA-RISC 1.1, <b>J-4</b> | of compilers, B-27, <b>B-29</b> | PID (process-identifier tags), C-36, | | | contention and, E-25, E-53 | C-37 | | PA-RISC 2.0, J-5 to J-6, <b>J-5</b> | of desktop computers, 44–46, <b>45</b> , | piggyback acknowledgment field, | | PA-RISC MAX2, J-16 to J-19, <b>J-18</b> | 46 | E-84 | | partial store order, 246 | development of measures of, K-6 | Pinkston, T. M., E-104 | | partitioned add operations, D-10 | to K-7 | pin-out constraint, E-39, E-71, E-89 | | Pascal, integer division and remainder | of DRAM, 312–315, <b>313, 314</b> | pipe stages, A-3, A-7 | | in, I-12 | effective bandwidth and, E-16 to | pipeline bubbles, A-13, A-20. See also | | passes, optimizing, B-25, <b>B-25</b> | E-19, <b>E-19</b> , E-25 to E-29, | pipeline stalls | | path loss, <b>D-21</b> | · · · · | pipeline depths, F-12 to F-13 | | Patterson, D. A., K-12 to K-13 | E-28, E-89, E-90 | pipeline hazards, A-11 to A-26. See | | payload, E-6, E-61 | Ethernet, E-89, <b>E-90</b> | | | PCI-Express (PCIe), E-29, E-63 | of floating-point operations, 3 | also dependences | | PC-relative addressing, B-10, B-18 | flow control and, E-17 | control hazards, A-11, A-21 to | | PC-relative control flow instructions, | I/O, 371–379, <b>372 to 376, 378</b> | A-26, <b>A-21</b> to <b>A-26</b> | | B-17 | of multicore processors, 255–257, | data hazards, A-11, A-15 to A-21, | | PDP-11 | 255, 256, 257 | A-16 to A-21 | | address size in, C-56 | of multiprocessors, 218–230, | detection of, A-33 to A-35, A-34 | | memory caches in, K-53 | 249–257 | in floating-point pipelining, A-49 | | memory hierarchy in, K-52 | of online transaction processing, | to A-54, A-51, A-57, A-58, | | Unibus, K-63 | 46, 47, 48 | A-61 to A-65, <b>A-61 to A-63</b> | | peak performance, 51, 52 | peak, 51, <b>52</b> | load interlocks, A-33 to A-35, | | peer-to-peer architectures, D-22 | pipeline stalls and, A-11 to A-13 | A-34 | | peer-to-peer communication, E-81 to | real-time, 7, D-3 | in longer latency pipelines, A-49 | | E-82 | of scientific applications, H-21 to | to A-54, A-50, A-51 | | Pegasus, K-9 | H-26, <b>H-23 to H-26</b> | multicycle operations and, A-46 to | | Pentium. See Intel Pentium; Intel | of servers, 46–48, <b>47, 48</b> | A-47 | | Pentium 4; Intel Pentium 4 | simultaneous multithreading and, | performance of pipelines with | | Extreme | 177–179, <b>178</b> | stalls, A-11 to A-13 | | Pentium chip, division bug in, I-2, I-64 | of superscalar processors, 16, | structural hazards, A-11, A-13 to | | to I-65 | 179–181, <b>179, 180, 181</b> | A-15, A-64, <b>A-65</b> | | Pentium D, 198 | topology and, E-40 to E-44, E-44, | pipeline latches, A-30, A-36 | | Pentium III, 183 | E-52 | pipeline registers, A-8 to A-10, A-9, | | Pentium M, 20 | transistors and, 17-19 | A-30, A-35 | | | | | | | | | | pipeline reservation tables, K-19 | in interconnection networks, | in multiple-issue processors, 182 | |---------------------------------------------------------------------|-------------------------------------------|---------------------------------------------| | pipeline scheduling, loop unrolling | E-12, E-25, E-51 to E-52, | redundancy of supplies, 27-28 | | and, 75–80, <b>75</b> , 117–118 | E-60, E-65, E-70 | reliability of, 49 | | pipeline stalls | interlocks, A-20, A-33 to A-35, | static, 19 | | bubbles, A-13, A-20 | <b>A-34</b> , A-52, F-9 to F-10 | transistor and wire scaling and, | | data hazards requiring stalls, A-19 | in Itanium 2 processor, G-42 | 17–19 | | to A-20, <b>A-20, A-21</b> | link, E-16, E-92 | Power processors, 128 | | diagrams of, A-13, A-15 | microinstruction execution, A-46 | Power2 processor, 130, A-43 | | in floating-point pipelines, A-51, | to A-47 | Power4 processor, 52 | | A-51 | MIPS branches in, A-35 to A-37, | Power5 processor. See IBM Power5 | | minimizing by forwarding, A-17 | A-38, A-39 | processor | | to A-18, A-18, A-35, A-36, | multicycle operations and, A-46 to | PowerEdge 1600SC, 323 | | A-37 | A-47 | PowerEdge 2800, 47, 48, 49 | | in MIPS pipelines, A-33 to A-35, | in multiplication, I-51 | PowerEdge 2850, <b>47, 48, 49</b> | | A-34 | overview of, 37, A-2 to A-3 | PowerPC | | in MIPS R4000 pipeline, A-63 to | in Pentium 4, 131–132, <b>132, 133</b> | addressing modes in, J-5 to J-6, | | A-65, <b>A-63, A-64</b> | performing issues in, A-10 to | J-5 | | performance and, A-11 to A-13 | A-11 | AltiVec in, F-47 | | in SMPs, <b>222</b> | SMP stalls in, 222 | common extensions in, J-19 to | | in vector processors, F-9 to F-10 | software, D-10, G-12 to G-15, | J-24, <b>J-21 to J-23</b> | | pipelined cache access, 296, 309 | G-13, G-15 | conditional branch options in, | | pipelined circuit switching, E-50, E-71 | stopping and restarting execution, | B-19 | | pipelines, self-draining, K-21 | A-41 to A-43 | features added to, J-44 | | pipelining, A-2 to A-77 | superpipelining, A-57 | instructions unique to, J-32 to | | in addition, I-25 | in switch microarchitecture, E-60 | J-33 | | basic MIPS, A-30 to A-33, A-31, | to E-61, <b>E-60</b> | MIPS core subset in, J-6 to J-16, | | A-32 | in vector processors, F-31 to F-32, | J-7, J-9 to J-13, J-17 | | condition codes in, A-5, A-46 | F-31 | multimedia support in, J-16 to | | data dependences and, 69–70 | Pleszkun, A. R., A-55, K-22 | J-19, <b>J-18</b> | | depth of, A-12 | pointers | performance per watt in, D-13, | | dynamic scheduling in, A-66 to | current frame, G-33 to G-34 | D-13 | | A-75, A-68, A-71, A-73 to | dependences and, G-9 | reduced code size in, B-23 | | A-75 | function, B-18 | PowerPC 620, A-55 | | in embedded systems, D-7 to | urgent pointer field, E-84 | PowerPC AltiVec, B-31, <b>D-11</b> | | D-10, <b>D-7</b> | in VAX, J-71 | precise exceptions, A-43, A-54 to A-56 | | encoding instruction sets and, | points-to analysis, G-9 to G-10 | Precision Workstation 380, <b>45</b> | | B-21 to B-22 | point-to-point links, 390, E-24, E-29, | predicated instructions, G-23 to G-27 | | exceptions in, A-38 to A-41, <b>A-40</b> , | E-79 | annulling instructions, G-26 | | A-42 | poison bits, G-28, G-30 to G-32 | in ARM, J-36 | | five-stage pipeline for RISC | Poisson distribution, 384–390, <b>388</b> | concept behind, G-23 | | processors, A-6 to A-10, <b>A-7</b> , | Poisson processes, 384 | conditional moves in, G-23 to | | A-8, A-9, A-21 | polycyclic scheduling, K-23 | G-24 | | floating-point, A-47 to A-56, A-48 | Popek, Gerald, 315 | exceptions in, G-25 to G-26 | | to <b>A-51</b> , <b>A-57</b> , <b>A-58</b> , A-60 to | POPF, 338 | in Intel IA-64, G-38, <b>G-39</b> | | A-62, A-61 to A-63 | position independence, B-17 | limitations of, G-26 to G-27 | | freezing/flushing, A-22<br>historical perspectives on, K-10, | postbytes, J-57, <b>J-57</b> | moving time-critical, G-24 to | | K-18 to K-27 | POWER, J-44 | G-25 | | | power | predication, D-10 | | increasing instruction fetch<br>bandwidth in, 121–127, <b>122</b> , | in cell phones, D-22, D-24 | predicted-not-taken scheme, A-22, | | 124, 126 | dynamic, 18–19 | A-22, A-25, A-26, A-26 | | 124, 120 | EEMBC benchmarks for | predicted-taken scheme, A-23, <b>A-25</b> , | | | consumption of, D-13, <b>D-13</b> | A-26, <b>A-26</b> | | prefetching | using parallelism to improve, | protocol families, E-81 | |--------------------------------------------------|---------------------------------------------------------|-------------------------------------------------------------------------| | in AMD Opteron, 330 | 37–38 | protocol fields, E-84 | | compiler-controlled, 305–309, 309 | processor-dependent optimizations,<br>B-26, <b>B-28</b> | protocol stacks, E-83, <b>E-83</b><br>protocols, E-8, E-62, E-77, E-91, | | development of, K-54 | processors. See also digital signal | E-93. See also names of | | hardware, 305, <b>306, 309</b> | processors; multiprocessing; | specific protocols | | instruction set architecture and, | superscalar processors; vector | PS2. See Sony Playstation 2 | | B-46 | processors; VLIW | PTE. See page table entries | | integrated instruction fetch units | processors; names of specific | | | and, 126 | processors | Q | | in RISC desktop architectures, | array, K-36 | QR factorization method, H-8 | | J-21 | directory-based multiprocessors, | QsNet, E-76 | | prefixes, in instructions, J-51, J-55 | H-29, <b>H-31</b> | quad precision, I-33 | | present bits, C-50 | in embedded computers, 7-8 | queue, 380 | | price-performance, in desktop | importance of cost of, 49-50 | queue depth, 360, <b>360</b> | | computers, 5, 5 | massively parallel, H-45 | queue discipline, 382 | | prices vs. costs, 25-28 | microprocessors, 5, 15, 16, 17, 20, | queuing locks, H-18 to H-20 | | primitives, 239–240, H-18 to H-21. | 20, 341 | queuing theory, 379–390 | | H-21 | multicore, 255–257, <b>255, 256,</b> | examples of, 387 | | principle of locality, 38, 288, C-2 | <b>257</b> , E-70 to E-72, E-92 | overview, 379-382, <b>379, 381</b> | | private data, 205 | out-of-order, C-19 to C-21, C-21 | Poisson distribution of random | | probability mass function, 384 | performance growth since | variables in, 382-390, 388 | | procedure invocation options, B-19 to | mid-1980s, 2-4, 3 | | | B-20 | Single-Streaming, F-40 to F-41, | R | | process switch, 316, C-48 | <b>F-41,</b> F-43 | races, 218, 245 | | processes | VPU, D-17 to D-18 | radio waves, D-21 to D-22, <b>D-23</b> | | defined, 199, 316, C-47 to C-48 | producer-server model, 371, 372 | RAID (redundant arrays of | | protection of, C-48 to C-49 | profile-based predictors, 161, 162 | inexpensive disks). See also | | process-identifier tags (PID), C-36, | program counter (PC), B-17. See also | disk arrays | | C-37 | PC-relative addressing | availability benchmark, 377, 378 | | processor consistency, 245 | program order. See also out-of-order | development of, K-61 to K-62 | | processor cycle, A-3 | completion; out-of-order | levels of, 362–366, <b>363, 365</b> | | processor performance, 28–44. See | execution | logical units in, 391 | | also benchmarks | control dependences and, 72-74 | reliability, 400 | | Amdahl's Law and, 39-42, 184 | data hazards and, 71 | RAID-DP (row-diagonal parity), | | average memory access time and, | memory consistency and. | 365–366, <b>365</b> | | C-17 to C-19 | 243-246 | RAMAC-350, K-59 to K-60, K-62 to | | benchmarks in, 29–33, <b>31, 35</b> | in shared-memory | K-63 | | of desktop systems, 45–46, <b>45</b> , <b>46</b> | multiprocessors, 206 | RAMBUS, 336 | | equation for, 41–44 | propagation delay, E-10, E-13, E-14, | random block replacement, C-9, C-10 | | execution time in, 28–29 | E-25, E-40 | random variables, distributions of, | | focusing on the common case, 38 | protection. See also memory | 382–390, <b>388</b> | | parallelism and, 37–38 | protection | RAS (row access strobe), 311-312, | | peak performance, 51, 52 | in 64-bit Opteron memory | 313 | | price and, 45–46, <b>45, 46</b> | management, C-53 to C-55, | Rau, B. R., K-23 | | principle of locality in, 38 | C-54, C-55 | RAW (read after write) hazards | | real-time, D-3 to D-5 | call gates, C-52 | in floating-point MIPS pipelines, | | summarizing benchmark results, | capabilities in. C-48 to C-49 | <b>A-50,</b> A-51 to A-53, <b>A-51</b> | | 33–37, <b>35</b> | in Intel Pentium, C-48 to C-52, | hardware-based speculation and, | | using benchmarks to measure, | C-51 | 112, 113 | | 29–33, <b>31</b> | rings in. C-48 | as ILP limitations, 71 | | | | | #### I-30 Index | RAW (read after write) hazards | register renaming | regularity, E-33. E-38 | |---------------------------------------------|--------------------------------------------|--------------------------------------------------------| | (continued) | finite registers and, 162–164, 163 | relative speedup, 258 | | load interlocks and, A-33 | in ideal processor, 155, 157 | relaxed consistency models, 245–246 | | in scoreboarding, A-69 to A-70, | name dependences and, 71 | release consistency, 246, K-44 | | A-72 | reorder buffers vs., 127–128 | reliability | | Tomasulo's approach and, 92 | in Tomasulo's approach, 92–93, | Amdahl's Law and, 49 | | read miss | 96–97 | benchmarks of, 377–379, <b>378</b> | | directory protocols and, 231, 233, | register rotation, G-34 | defined, 366–367 | | 234–237, <b>236</b> | register stack engine, G-34 | "five nines" claims of availability, | | miss penalty reduction and, 291, | register windows, J-29 to J-30 | 399–400, <b>400</b> | | C-34 to C-35, <b>C-39</b> | register-memory ISAs, 9, B-3, <b>B-4</b> , | implementation location and, 400 | | in Opteron data cache, C-13 to | B-5, <b>B-6</b> | in interconnection networks, E-66 | | C-14 | register-register architecture, B-3 to | module, 26 | | in snooping protocols, 212, <b>213</b> , | B-6, <b>B-4, B-6</b> | operator, 369–371 | | 214 | registers | relocation, C-39 | | real addressing mode, J-45, J-50 | base, A-4 | remote memory access time, H-29 | | real memory, in virtual machines, 320 | branch, J-32 to J-33 | remote nodes, 233, <b>233</b> | | real-time constraints, D-2 | count, J-32 to J-33 | renaming. See register renaming | | real-time performance, 7, D-3 | current frame pointer, G-33 to | renaming maps, 127-128 | | rearrangeably non-blocking networks, | G-34 | reorder buffers (ROB) | | E-32 | finite, effect on ILP, 162–164, <b>163</b> | development of, K-22 | | receiving overhead, E-14, E-17, E-63, | floating-point, A-53, B-34, B-36 | in hardware-based speculation, | | E-76, E-92 | general-purpose, B-34 | 106–114, <b>107, 110, 111, 113,</b> | | reception bandwidth, E-18, E-26, | history and future files, A-55 | G-31 to G-32 | | E-41, E-55, E-63, E-89 | in IBM Power5, 162 | renaming vs., 127-128 | | RECN (regional explicit congestion | instruction encoding and, B-21 | in simultaneous multithreading, | | notification), E-66 | in instruction set architectures, 9, | 175 | | reconfiguration, E-45 | 9 | repeat (initiation) intervals, A-48 to | | recovery time, F-31, F-31 | integer, B-34 | A-49, <b>A-49, A-62</b> | | recurrences, G-5, G-11 to G-12 | in Intel 80x86, J-47 to J-49, J-48, | repeaters, E-13 | | red-black Gauss-Seidel multigrid | J-49 | replication of shared data, 207-208 | | technique, H-9 to H-10 | in Intel IA-64, G-33 to G-34 | requested protection level, C-52 | | Reduced Instruction Set Computer | link, 240, J-32 to J-33 | request-reply, E-45 | | architectures. See RISC | loop unrolling and, 80 | reservation stations, 93, <b>94,</b> 95–97, <b>99,</b> | | (Reduced Instruction Set | in MIPS architecture, B-34 | <b>101,</b> 104 | | Computer) architectures | in MIPS pipeline, A-30 to A-31, | resource sparing, E-66, E-72 | | redundant arrays of inexpensive disks. | A-31 | response time. See also execution | | See RAID | number required, B-5 | time; latency | | redundant quotient representation, | pipeline, A-8 to A-10, <b>A-9</b> | defined, 15. 28, 372 | | I-47, I-54 to I-55, <b>I-55</b> | predicate, G-38, G-39 | throughput vs., 372-374, 373, 374 | | regional explicit congestion | in RISC architectures, A-4, A-6, | restarting execution, A-41 to A-43 | | notification (RECN), E-66 | A-7 to A-8, <b>A-8</b> | restoring division algorithm, I-5 to I-7, | | register addressing mode, B-9 | in scoreboarding, A-71, A-72 | I-6 | | register fetch cycle, A-5 to A-6, A-26 | in software pipelining, G-14 | restricted alignment, B-7 to B-8, B-8 | | to A-27, <b>A-29</b> | in Tomasulo's approach, 93, 99 | resuming events, A-41, A-42 | | register indirect addressing mode | in VAX procedure, J-72 to J-76, | return address predictors, 125, <b>126</b> , | | jumps, B-17 to B-18, <b>B-18</b> | <b>J-75,</b> J-79 | K-20 | | in MIPS data transfers, B-34 | vector-length, F-16 to F-18 | returns, procedure, B-17 to B-19, B-17 | | overview of, <b>B-9</b> , B-11, <b>B-11</b> | vector-mask, F-26 | reverse path, in cell phone base | | register prefetch, 306 | in VMIPS, F-6, <b>F-7</b> | stations, D-24 | | register pressure, 80 | VS, F-6 | rings, C-48, E-35 to E-36, E-36, E-40, | | | | E-70 | | ripple-carry addition, I-2 to I-3, I-3, I-42, I-44 | algorithm for, E-45, E-52, E-57,<br>E-67 | Barnes application, H-8 to H-9, | |----------------------------------------------------------|----------------------------------------------------|-------------------------------------------| | RISC (Reduced Instruction Set | deterministic, E-46, E-53 to E-54, | computation-to-communication | | Computer) architectures, J-1 | <b>E-54,</b> E-93 | ratio in, H-10 to H-12, <b>H-11</b> | | to J-90 | packet header information, E-7, | on distributed-memory | | ALU instructions in, A-4 | E-21 | multiprocessors, H-26 to | | classes of instructions in, A-4 to | in shared-media networks, E-22 to | H-32, H-28 to H-32 | | A-5 | E-24, E-22 | FFT kernels, H-7, <b>H-11</b> , H-21 to | | digital signal processors in | switch microarchitecture and, | H-29, <b>H-23 to H-26, H-28 to</b> | | embedded, J-19 | <b>E-57,</b> E-60 to E-61, E-61 | H-32 | | five-stage pipeline for, A-6 to | in switched-media networks, E-24 | LU kernels, H-8, <b>H-11</b> , H-21 to | | A-10, <b>A-7</b> , <b>A-8</b> , <b>A-9</b> , <b>A-21</b> | routing algorithm, E-45, E-52, E-57, | H-26, H-23 to H-26, H-28 to | | historical perspectives on, 2, K-12 | E-67 | H-32 | | to K-15, K-14 | row access strobes (RAS), 311–312, | need for more computation in, | | lineage of, <b>J-43</b> | 313 | 262 | | MIPS core extensions in, J-19 to | row major order, 303 | Ocean application, H-9 to H-12, | | J-24, <b>J-21 to J-24</b> | row-diagonal parity (RAID-DP), | H-11 | | MIPS core subsets in, J-6 to J-16, | 365–366, <b>365</b> | parallel processor performance in | | | Rowen, C., I-58 | H-33 to H-34 | | J-7 to J-16 multimedia extensions in, J-16 to | RP3, K-40 | | | | • | on symmetric shared-memory | | J-19, <b>J-18</b> | RS 6000, K-13 | multiprocessors, H-21 to | | overview of, A-4 to A-5, <b>J-42</b> | _ | H-26, H-23 to H-26 | | pipelining efficiency in, A-65 to | S | scoreboarding, A-66 to A-75 | | A-66 | SAGE, K-63 | basic steps in, A-69 to A-70, A-72 | | reduced code size in, B-23 to | Saltzer, J. H., E-94 | A-73, A-74 | | B-24 | SAN (system area networks), E-3, | costs and benefits of, A-72 to | | simple implementation without | E-72 to E-77, E-75 to E-77, | A-75, <b>A-75</b> | | pipelining, A-5 to A-6 | E-100 to E-102. See also | data structure in, A-70 to A-72, | | unique instructions in, J-24 to | interconnection networks | A-71 | | J-42, <b>J-26, J-31, J-34</b> | Santayana, George, K-1 | development of, 91, K-19 | | virtualization of, 320 | Santoro, M. R., I-26 | goal of, A-67 | | RISC-I/RISC-II, K-12 to K-13 | Sanyo VPC-SX500 digital camera, | in Intel Itanium 2, G-42, G-43 | | ROB. See reorder buffers | D-19, <b>D-20</b> | structure of, A-67 to A-68, <b>A-68</b> , | | rotate with mask instructions, J-33 | SAS (Serial Attach SCSI), 361, 361 | A-71 | | rounding | SATA disks, 361, <b>361</b> | scratch pad memory (SPRAM), D-17. | | double, I-34, I-37 | saturating arithmetic, D-11, J-18 to | D-18 | | in floating-point addition, I-22 | J-19 | SCSI (small computer systems | | in floating-point division, I-27, | scalability, 37, 260–261, K-40 to K-41 | interface), 360–361, <b>361,</b> | | 1-30 | scaled addressing mode, <b>B-9</b> , <b>B-11</b> , | K-62 to K-63 | | in floating-point multiplication, | J-67 | SDRAM (synchronous DRAM), | | I-17 to I-18, I-18, I-19, I-20 | scaled speedup, 258-259, H-33 to | 313–314, 338, <b>338</b> | | floating-point remainders, I-31 | H-34 | SDRWAVE, I-62 | | fused multiply-add, I-32 to I-33 | scaling, 17-19, 259, H-33 to H-34 | sector-track cylinder model, 360–361 | | in IEEE floating-point standard, | Scarott, G., K-53 | security. See memory protection | | I-13 to I-14, <b>I-20</b> | scatter-gather operations, F-27 to | seek distances and times, 401–403, | | precision and, I-34 | F-28, F-48 | 402, 403 | | underflow and, I-36 | scheduling, historical perspectives on, | segments | | round-off errors, D-6, <b>D-6</b> | K-23 to K-24 | segment descriptors, C-50 to | | round-robin, E-49, E-50, E-71, E-74 | Schneck, P. B., F-48 | C-51, C-51 | | routing | scientific/technical computing, H-6 to | in virtual memory, C-40 to C-42, | | adaptive, E-47, E-53 to E-54, | H-12 | C-41, C-42, C-49 | | <b>E-54,</b> E-73, E-93 to E-94 | | self-draining pipelines, K-21 | | | | | | self-routing property, E-48 | shadow page tables, 320 | potential performance advantages | |---------------------------------------------|-----------------------------------------------|------------------------------------------| | semantic clash, B-41 | shadowing (mirroring), 362, <b>363</b> , K-61 | from, 177–179, <b>178</b> | | semantic gap, B-39, B-41, K-11 | to K-62 | preferred-thread approach, | | sending overhead, E-14, E-16, E-63, | shared data, 205 | 175–176 | | E-76, E-92 | shared link networks, E-5 | single extended precision, I-16, I-33 | | sense-reversing barriers, H-14 to | shared memory. See also distributed | single instruction stream, multiple data | | H-15, <b>H-15</b> , H-21, <b>H-21</b> | shared-memory | streams (SIMD). See SIMD | | sentinels, G-31, K-23 | multiprocessors; symmetric | single instruction stream, single data | | sequential consistency, 243–244, K-44 | shared-memory | streams (SISD), 197, K-35 | | sequential interleaving, 299, <b>299</b> | multiprocessors | single-chip multiprocessing, 198. See | | serial advanced technology attachment | communication, H-4 to H-6 | also multicore processors | | (SATA), 361, <b>361</b> , E-103 | defined, 202 | single-precision numbers | | Serial Attach SCSI (SAS), 361, 361 | multiprocessor development, | IEEE standard on, <b>I-16</b> , I-33 | | serialization, 206–207, H-16, H-37 | K-40 | multiplication of, I-17 | | serpentine recording, K-59 | synchronization, J-21 | representation of, I-15 to I-16 | | serve-longest-queue, E-49 | shared-media networks, E-21 to E-25, | rounding of, I-34 | | server benchmarks, 32–33 | <b>E-22,</b> E-78 | Single-Streaming Processors (SSP), | | servers | shared-memory communication, H-4 | F-40 to F-41, F-41, F-43 | | characteristics of, D-4 | to H-6 | SISD (single instruction stream, single | | defined, 380, <b>381</b> | shifting over zeros technique, I-45 to | data streams) 197, K-35 | | downtime costs, 6 | I-47, <b>I-46</b> | Sketchpad, K-26 | | instruction set principles in, B-2 | shortest path, E-45, E-53 | SLA (Service Level Agreements), | | memory hierarchy in, 341 | sign magnitude system, I- 7 | 25–26 | | operand type and size in, B-13 to | signal processing, digital, D-5 to D-7, | sliding window protocol, E-84 | | B-14 | D-6 | SLO (Service Level Objectives), | | performance and | signals, in embedded systems, D-2 | 25-26 | | price-performance of, 46–48, | signal-to-noise ratio (SNR), D-21, | Slotnick, D. L., K-35 | | 47, 48 | D-21 | small computer systems interface | | price range of, <b>5</b> , 6–7 | signed numbers, I-7 to I-10, I-23, I-24, | (SCSI), 360–361, <b>361,</b> K-62 | | requirements of, 6–7 | 1-26 | to K-63 | | transaction-processing, 46–48, 47, | signed-digit trees, I-53 to I-54 | Smalltalk, J-30 | | 48 | sign-extended offsets, A-4 to A-5 | smart switches, E-85 to E-86, E-86 | | utilization, 381, 384–385, 387 | significands, I-15 | Smith, A. J., K-53 to K-54 | | Service Level Agreements (SLA), | Silicon Graphics MIPS 1. See MIPS 1 | Smith, Burton, K-26 | | 25–26 | Silicon Graphics MIPS 16. See MIPS | Smith, J. E., A-55, K-22 | | Service Level Objectives (SLO), | 16 | SMP. See symmetric shared-memory | | 25–26 | SIMD (single instruction stream, | multiprocessors | | service specification, 366 | multiple data streams) | SMT. See simultaneous multithreading | | set-associative caches | compiler support for, B-31 to | snooping protocols, 208218 | | defined, 289 | B-32 | cache coherence implementation | | miss rate and, C-28 to C-29, C-29 | defined, 197 | and, H-34 | | n-way cache placement, 289, C-7, | in desktop processors, D-11, <b>D-11</b> | development of, K-39 to K-40, | | C-8 | in embedded systems. D-10, <b>D-16</b> | K-39 | | parallelism in, 38 | historical perspectives on, K-34 to | examples of, 211–215, <b>213, 214</b> , | | structure of, C-7 to C-8, C-7, C-8 | K-36 | 215, K-39 | | sets, defined, C-7 | Streaming SIMD Extension, B-31 | implementation of, 209–211, | | settle time, 402 | simultaneous multithreading (SMT), | 217-218 | | SFS benchmark, 376 | 173–179 | limitations of, 216-217, <b>216</b> | | SGI Altix 3000, <b>339</b> | approaches to superscalar issue | overview, 208–209, <b>209</b> | | SGI Challenge, K-39 | slots, 174–175, 174 | SNR (signal-to-noise ratio), D-21, | | SGI Origin, H-28 to H-29, <b>H-31,</b> K-41 | design challenges in, 175-177 | D-21 | | shadow fading, D-21 | development of, K-26 to K-27 | | | SoC (system-on-chip). D-3, D-19, | SPEC89, 30 | superlinear, 258 switch microarchitecture and, | |---------------------------------------------------------|-------------------------------------------------|-----------------------------------------------------| | <b>D-20,</b> E-23, E-64 | SPEC92, 157 | | | soft real-time systems, 7, D-3 | SPEC2000, 331–335, <b>332, 333</b> , | E-62 | | software | 334 | spin locks | | optimization, 261–262, 302–305, | SPECfp, E-87<br>SPEChpc96, F-51 | coherence in implementation of, 240–242, <b>242</b> | | 304, 309 | SPECint, E-87 | with exponential back-off, H-17 | | pipelining, D-10, G-12 to G-15, | SPECMail, 376 | to H-18, <b>H-17</b> | | G-13, G-15 | SPEC-optimized processors, E-85 | spin waiting, 241, 242 | | speculation, control dependences | SPECrate, 32 | SPRAM (scratch pad memory), D-17, | | and, 74 | SPECRatio, 34–37, <b>35</b> | D-18 | | Solaris, 377–379, <b>378</b> | SPECSFS, 32, 376 | spread spectrum, D-25 | | Sony Playstation 2 (PS2) block diagram of, <b>D-16</b> | Web site for, 30 | square root computations, I-14, I-31, | | embedded microprocessors in, | special-purpose register computers. | I-64, J-27 | | - | B-3 | squared coefficient of variance, 383 | | D-14 Emotion Engine in D-15 to D-18 | spectral methods, computing for, H-7 | SRAM (static RAM), 311, F-16 | | Emotion Engine in, D-15 to D-18, | speculation. See also hardware-based | SRC-6 system, F-50 | | D-16, D-18 | speculation see also hardware-based speculation | SRT division, I-45 to I-47, <b>I-46</b> , I-55 | | Graphics Synthesizer, <b>D-16</b> , D-17 | compiler, G-28 to G-32 | to I-58, <b>I-57</b> | | to D-18 | development of, K-22 | SSE (Streaming SIMD Extension), | | vector instructions in, F-47 | dynamic scheduling in, 104 | B-31 | | source routing, E-48 | · - | SSE/SSE2, J-46 | | SPARC | memory latency hiding by,<br>247–248 | stack architecture | | addressing modes in, J-5 to J-6, | misspeculation rates in the | extended, J-45 | | J-5 | Pentium 4, 134–136, <b>135</b> | high-level languages and, B-45 | | architecture overview, <b>J-4</b> | multiple instructions and, | historical perspectives on, B-45, | | common extensions in, J-19 to J-24, <b>J-21 to J-23</b> | 118–121, <b>120, 121</b> | K-9 to K-10 | | conditional branch options in, | optimizing amount of, 129 | in Intel 80x86, J-52 | | B-19 | register renaming vs. reorder | operands in, B-3 to B-5, <b>B-4</b> | | exceptions and, A-56 | buffers, 127–128 | stalls. See also dependences; pipeline | | extended precision in, I-33 | software, 74 | stalls | | features added to, <b>J-44</b> | through multiple branches, 129 | bubbles, A-13, A-20, E-47, E-53 | | instructions unique to, J-29 to | value prediction and, 130 | control, 74 | | J-32, <b>J-31</b> | speculative code scheduling, K-23 | data hazard, A-19 to A-20, A-20, | | MIPS core subset in, J-6 to J-16, | speculative execution, 325 | A-21, A-59, A-59 | | J-7, J-9 to J-13, J-17 | speed of light, E-11 | forwarding and, A-17 to A-18, | | multiply-step instruction in, I-12 | speed of hght, 2007 | A-18 | | register windows in, J-29 to J-30 | Amdahl's law and, 39–41, | reservation stations and, 93, 94, | | SPARC VIS, <b>D-11</b> , J-16 to J-19, <b>J-18</b> | 202–203 | 95–97, <b>99, 101,</b> 104 | | SPARCLE processor, K-26 | in buffer organizations, E-58 to | write, C-11 | | sparse array accesses, G-6 | E-60 | standard deviation, 36 | | sparse matrices, in vector mode, F-26 | cost-effectiveness and, 259, 260 | Stanford DASH multiprocessor, K-41 | | to F-29 | execution time and, 257-258 | Stanford MIPS computer, K-12 to | | spatial locality, 38, 288, C-2, C-25 | linear, 259-260, <b>260</b> | K-13, K-21 | | SPEC (Standard Performance | as performance measure in | start-up time, in vector processors, | | Evaluation Corporation) | parallel processors, H-33 to | F-11 to F-12, <b>F-13,</b> F-14, | | evolution of, 29–32, <b>31</b> , K-7 | H-34 | <b>F-20,</b> F-36 | | Perfect Club and, F-51 | from pipelining, A-3, A-10 to | starvation, E-49 | | reproducibility of, 33 | A-13 | state transition diagrams, 234–236, | | SPEBWeb, 32, 249 | relative vs. true, 258 | 235, 236 | | SPEC CPU2000, 31, <b>35</b> | scaled, 258–259, H-33 to H-34 | static branch prediction, 80–81, <b>81</b> , | | SPEC CPU2006, 30, 31 | from SMT, 177–178, <b>178</b> | D-4 | | | • | | ### I-34 Index | static scheduling, A-66 | strong typing, G-10 | in embedded systems, D-8 | |---------------------------------------------------------------|---------------------------------------------------------------|--------------------------------------------------------------------------| | steady state, <b>379</b> , 380 | structural hazards, A-11, A-13 to A-15, | goals of, 114 | | sticky bits, I-18, I-19 | A-70. See also pipeline | ideal, 155–156, <b>157</b> | | Stop & Go flow control, E-10 | hazards | increasing instruction fetch | | storage area networks, E-3, E-102 to | subset property, 248 | bandwidth in, 121–127, <b>122</b> , | | E-103 | subtraction, I-22 to I-23, I-45 | 124, 126 | | storage systems, 357-404. See also | subword parallelism, J-17 | issue slots in, 174–175, <b>174</b> | | disk storage; I/O | Sun Java Workstation W1100z, 46–47, | limitations of, 181–183 | | asynchronous I/0, 391 | 46 | SMT performance comparison on, | | block servers vs. filers, 390-391 | Sun Microsystems, fault detection in, | 179–181, <b>179, 180, 181</b> | | dependability benchmarks, | 51–52 | speculation in, 118–121, <b>120, 121</b> | | 377–379, <b>378</b> | Sun Microsystems SPARC. See | types of, 114 | | disk arrays, 362–366, 363, 365 | SPARC | vectorization of, F-46 to F-47 | | disk storage improvements, | Sun Microsystems UNIX, C-36 to | supervisor process, 316 | | 358-361, <b>359, 360, 361</b> | C-37 | Sur, S., E-77 | | faults and failures in, 366-371, | Sun Niagara processor, 300, 341 | Sutherland, Ivan, K-26 | | 369, 370 | Sun T1 | SV1ex, F-7 | | filers, 391, 397-398 | directories in, 208, 231 | Swartzlander, E., I-63 | | flash memory, 359–360 | multicore, 198, 205 | switch degree, E-38 | | Internet Archive, 392–397, <b>394</b> | multithreading in, 250-252, 251 | switch microarchitecture, E-55, E-60 | | I/O performance, 371–379, <b>372 to</b> | organization of, 249, 250, 251 | switch statements, register indirect | | 376, 378 | overall performance of, 253-257, | jumps for, B-18 | | point-to-point links and switches | 253 to 257 | switched point-to-point networks, E-5 | | in, 390, <b>390</b> | Sun Ultra 5, 35 | switched-media networks, E-21, E-24, | | queuing theory, 379–382, <b>379</b> , | Sun UltraSPARC, E-73 | E-25 | | 381 | Super Advanced IC, <b>D-20</b> | switches | | sector-track cylinder model, | superblocks, G-21 to G-23, G-22 | context, 316 | | 360–361 | supercomputers, 7 | input-buffered, E-57, E-59, E-62, | | Tandem disks, 368–369, <b>370</b> | SuperH | E-73 | | Tertiary Disk project, 368, <b>369</b> , 399, <b>399</b> | addressing modes in, J-5 to J-6, | input-output-buffered, E-57, | | * | J-6 | E-57, E-60, E-61, E-62 | | throughput <i>vs.</i> response time, 372–374, <b>373, 374</b> | architecture overview, <b>J-4</b> | microarchitecture, E-55 to E-58, | | transaction-processing | common extensions in, J-19 to J-24, <b>J-23</b> , <b>J-24</b> | E-56, E-57, E-62 | | benchmarks, 374–375, <b>375</b> | conditional branch options in, | output-buffered, E-57, E-59 | | StorageTek 9840, K-59 | B-19 | pipelining, E-60 to E-61, <b>E-61</b><br>point-to-point, 390, <b>390</b> | | store buffers, 94–95, <b>94</b> , 97, <b>101</b> , | instructions unique to. J-38 to | process, 316, C-48 | | 102–104 | J-39 | smart, E-85 to E-86, <b>E-86</b> | | store conditional instruction, 239–240 | MIPS core subset in, J-6 to J-16, | switching | | store-and-forward switching, E-50, | J-8, J-9, J-14 to J-17 | buffered wormhole, E-51 | | E-79 | multiply-accumulate in, J-19, | circuit, E-50, E-64 | | streaming buffers, K-54 | J-20 | cut-through, E-50, E-60, E-74 | | Streaming SIMD Extension (SSE), | reduced code size in, B-23 to | defined, E-22 | | B-31 | B-24 | network performance and, E-52 | | Strecker, W. D., C-56, J-65, J-81, | superlinear speedup, 258 | packet, E-50, E-77 | | K-11, K-12, K-14, K-52 | "superpages," C-57 | pipelined circuit, E-50, E-71 | | Stretch (IBM 7030), K-18 | superpipelining, A-57 | in shared-media networks, E-23 | | stride, F-21 to F-23 | superscalar (multiple-issue) processors | store-and-forward, E-50, E-79 | | strided addressing, B-31 | characteristics of, 115 | in switched-media networks, E-24 | | strip mining, F-17 to F-18, <b>F-17</b> , F-39 | development of, 16, K-21 to K-22, | technique of, E-50, E-52 | | striping, 362–364, <b>363</b> | K-25 to K-26 | | | virtual cut-through, E-51, E-73,<br>E-92 | system area networks (SAN), E-3,<br>E-4, E-72 to E-77, E-75 to | simultaneous multithreading in, 173–179, <b>174, 178</b> | |------------------------------------------|----------------------------------------------------------------|----------------------------------------------------------| | wormhole, E-51, E-58, E-92 | E-77, E-100 to E-102. See | threads, 172, 199 | | syllables, G-35 | also interconnection | three-hop miss, H-31 | | symmetric shared-memory | networks | three-phased arbitration, E-49 | | multiprocessors (SMPs), | system calls, 316 | throttling, E-10, E-53 | | 205–218 | system-on-chip (SoC), D-3, D-19, | throughput. See also bandwidth; | | architecture of, 200, 200 | <b>D-20,</b> E-23, E-64 | effective bandwidth | | cache coherence protocols in, | | congestion management and, | | 205–208, <b>206</b> | T | E-54, E-65 | | coherence in, 205-208 | tag field, C-8, C-8 | defined, 15, 28, 372, E-13 | | commercial workload | tags | deterministic vs. adaptive routing | | performance in, 220-224, | function of, 289, C-8 | and, E-53 to E-54, <b>E-54</b> | | 221 to 226 | in Opteron data cache, C-12 to | I/O, 371 | | in large-scale multiprocessors, | C-13 | Thumb. See ARM Thumb | | H-45 | process-identifier, C-36, C-37 | Thunder Tiger4, E-20, E-44, E-56 | | limitations of, 216-217, <b>216</b> | in snooping protocols, 210–211 | TI 320C6x, D-8 to D-10, <b>D-9, D-10</b> | | scientific application performance | in SPARC architecture, J-30, <b>J-31</b> | TI 8847 chip, I-58, I-58, I-59, I-61 | | on, H-21 to H-26, H-23 to | tail duplication, G-21 | TI ASC, F-44, F-47 | | H-26 | tailgating, F-39 to F-40 | TI TMS320C55, D-6 to D-8, <b>D-6, D-7</b> | | shared vs. private data in, 205 | Takagi, N., I-65 | time division multiple access | | snooping protocol example, | Tandem disks, 368–369, <b>370</b> | (TDMA), D-25 | | 211–215, <b>213, 214, 215</b> | Tanenbaum, A. S., K-11 | time of flight, E-13, E-25 | | snooping protocol implementation | TB-80 cluster, <b>394,</b> 396–397 | time per instruction, in pipelining, A-3 | | in, 208–211, 217–218 | TCP/IP, E-81, E-83, <b>E-84</b> , E-95 | time-constrained scaling, 259, H-33 to | | symmetry, E-33, E-38 | TDMA (time division multiple | H-34 | | Synapse N + 1, K-39, <b>K-39</b> | access), D-25 | time-domain filtering, D-5 | | synchronization, 237–242 | telephone company failures, 371 | time-sharing, C-48 | | barrier, H-13 to H-16, H-14, | temporal locality, 38, 288, C-2 | time-to-live field, E-84 | | Н-15, Н-16 | Tera processor, K-26 | TLB. See translation lookaside buffers | | development of, K-40, K-44 | terminating events, A-41, A-42 | TLP. See thread-level parallelism | | hardware primitives, 238-240, | Tertiary Disk project, 368, <b>369</b> , 399. | Tomasulo's approach to dynamic | | H-18 to H-21, H-21 | 399 | scheduling, 92–104 | | implementing locks using | test-and-set synchronization primitive, | advantages of, 98, 104 | | coherence, 240-242, <b>242</b> | 239 | algorithm details, 100-101, <b>101</b> | | memory consistency and, | TFLOPS multiprocessor, K-37 to | basic processor structure in, 94, | | 244–245 | K-38 | 94 | | performance challenges in | Thinking Machines, K-35, K-61 | dynamic scheduling using, 92-97 | | large-scale multiprocessors, | Thinking Multiprocessor CM-5, K-40 | hardware-based speculation in, | | H-12 to H-16, <b>H-14, H-15,</b> | Thornton, J. E., K-10 | 105–114 | | Н-16 | thread-level parallelism (TLP), | instruction steps in, 95 | | sense-reversing barriers, H-14 to | 172–179. See also | loop-based example, 102-104 | | H-15, <b>H-15</b> | multiprocessing; | multiple issue and speculation | | serialization in, H-16 | multithreading | example, 118–121, <b>120, 121</b> | | software implementations, H-17 | defined, 172 | register renaming in, 127-128 | | to H-18, <b>H-17</b> | instruction-level parallelism vs | reorder buffer in, 106–114, <b>107</b> , | | synchronous DRAM (SDRAM), | 172 | 110, 111, 113 | | 313–314, 338, <b>338</b> | in MIIMD computers, 197 | reservation stations in, 93, <b>94</b> , | | synchronous events, A-40, A-41, A-42 | processor comparison for, | 95–97, <b>99, 101,</b> 104 | | synchronous I/O, 391 | 179-181, <b>179, 180, 181</b> | software pipelining compared to, | | synonyms, 329, C-36 | processor limitations in, 181-183 | G-12 | | synthetic benchmarks, 29 | reasons for rise of, 262-264 | | | | | | | topology, E-29 to E-44 | translation buffers (TB). See | TV 2 V 26 | |-----------------------------------------------------|----------------------------------------------------|------------------------------------------| | in centralized switched networks, | translation lookaside buffers | TX-2, K-26<br>type fields, E-84 | | E-30 to E-34, <b>E-31</b> , <b>E-33</b> | translation lookaside buffers (TLB) | type ficias, E-64 | | defined, E-21 | in AMD Opteron, 326–327, <b>327</b> , | U | | in distributed switched networks, | 328, C-55, C-55 | • | | E-34 to E-39, <b>E-36</b> , <b>E-37</b> , | cache hierarchy and, 291, <b>292</b> | Ultracomputer, K-40 | | E-40 | development of, K-52 | Ultra-SPARC desktop computers, | | network performance and, E-40 to | in MIPS 64, K-52 | K-42 | | E-44, <b>E-44</b> , E-52 | misses and, C-45 | Ultrix operating system, C-37, E-69 | | torus | speculation and, 129 | UMA (uniform memory access), 200, | | in IBM Blue Gene/L, E-53 to | virtual memory and, 317, 320, | <b>200, 216,</b> 217 | | E-55, <b>E-54</b> , E-63, E-72 to | <b>323,</b> C-36, C-43 to C-45, | unbiased exponents, I-15 | | E-74 | C-45 | uncertainty, code, D-4 | | overview of, E-36 to E-38 | Transmission Control Protocol, E-81 | underflow, I-15, I-36 to I-37, I-62 | | performance and cost of, <b>E-40</b> | transmission speed, E-13 | unicasting, E-24 | | total ordering in, E-47 | transmission time, E-13 to E-14 | Unicode, B-14, B-34 | | total store ordering, 245 | transport latency, E-14 | unified caches, C-14, C-15 | | tournament predictors, 86–89, <b>160</b> , | trap handlers, I-34 to I-35, I-36, J-30 | uniform memory access (UMA), 200, | | 161, <b>162,</b> K-20 | trap instructions, A-42 | <b>200, 216,</b> 217. See also | | toy programs, 29 | tree height reduction, G-11 | symmetric shared-memory | | TP (transaction-processing) | tree-based barriers, H-18, <b>H-19</b> | multiprocessors | | benchmarks, 32–33, | trees | unit stride addressing, B-31 | | 374–375, <b>375</b> | binary tree multipliers, I-53 to | UNIVAC I, K-5 | | TPC (Transaction Processing | I-54 | unpacked numbers, I-16 | | Council), 32, 374–375, <b>375</b> | combining, H-18 | unpacking operations, B-14 | | TPC-A, 32 | fat, <b>E-33</b> , E-34, E-36, E-38, <b>E-40</b> , | up*/down* routing, E-48, E-67 | | TPC-App, 32 | E-48 | upgrade misses, H-35 | | TPC-C, 32, 46-47 | multiply, I-52 to I-53, <b>I-53</b> | upgrade requests, 219 | | TPC-H, 32 | octrees, H-9 | urgent pointer fields, E-84 | | TPC-W, 32 | signed-digit, I-53 to I-54 | use bits, C-43 to C-44 | | trace caches, 131, <b>132, 133,</b> 296, <b>309</b> | tree height reduction, G-11 | user maskable events, A-41, A-42 | | trace compaction, G-19 | tree-based barriers, H-18, <b>H-19</b> | user miss rates, 228, <b>228, 229</b> | | trace scheduling, G-19 to G-21, G-20 | Wallace, I-53 to I-54, <b>I-53</b> , I-63 | user nonmaskable events, A-41, A-42 | | trace selection, G-19 | Trellis codes, D-7 | user productivity, transaction time and, | | traffic intensity, 381 | trigonometric functions, I-31 to I-32 | 372–374, <b>373, 374</b> | | Transaction Processing Council | TRIPS Edge processor, E-63 | user-level communication, E-8 | | (TPC), 32, 374–375, <b>375</b> | Trojan horses, C-49, C-52 | user-requested events, A-40 to A-41, | | transaction time, 372 | true sharing misses, 218–219, 222, | A-42 | | transaction-processing benchmarks, | 224, 225 | | | 32–33, 374–375, <b>375</b> | true speedup, 258 | V | | transaction-processing servers, 46–48, | tunnel diode memory, K-53 | valid bits, C-8 | | 47, 48 | Turing, Alan, K-4 | value prediction, 130, 154-155, 170, | | transactions, steps in, 372, <b>373</b> | Turn Model, E-47 | K-25 | | transcendental functions, I-34, J-54 | two-level predictors, 85 | variable-length encoding, 10, B-22 to | | transfers, instructions as, B-16 | two-phased arbitration, E-49 | B-23, <b>B-22</b> | | transient failures, E-66 | two's complement system, I-7 to I-10 | variables, register types and, B-5 | | transient faults, 367, 378–379 | two-way conflict misses, C-25 | variance, 383 | | transistors, performance scaling in, | two-way set associative blocks, C-7, | VAX, J-65 to J-83 | | 17–19 | C-8 | addressing modes in, J-67, J-70 to | | ** | | J-71 | | architecture summary, <b>J-42</b> , J-66 | registers, F-16 to F-18 | versions, E-84 | |-------------------------------------------------------------|-----------------------------------------------------------|-----------------------------------------------| | to J-68, <b>J-66</b> | vector loops, execution time of, F-35. | very long instruction word processor | | CALLS instruction in, B-41 to | See also loop unrolling | See VLIW processors | | B-43 | vector processors, F-1 to F-51 | victim blocks, 301, 330 | | code size overemphasis in, B-45 | advantages of, F-2 to F-4 | victim buffers, 301, 330, C-14 | | condition codes in, J-71 | basic architecture of, F-4 to F-6, | victim caches, 301, K-54 | | conditional branch options in, | F-5, F-7, F-8 | virtual addresses, C-36, C-54 | | B-19 | chaining in, F-23 to F-25, <b>F-24</b> , | virtual caches, C-36 to C-38, C-37 | | data types in, <b>J-66</b> | F-35 | virtual channels | | encoding instructions in, J-68 to | characteristics of various, F-7 | head-of-line blocking and, E-59 | | J-70, <b>J-69</b> | conditionally executed statements | E-59 | | exceptions in, A-40, A-45 to A-46 | in, F-25 to F-26 | in IBM Blue Gene/L ED Torus, | | frequency of instruction | Cray X1. F-40 to F-44, F41, F-42 | E-73 | | distribution, J-82, J-82 | Earth Simulator, F-3 to F-4 | in InfiniBand, E-74 | | goals of, J-65 to J-66 | historical perspectives on, F-47 to | performance and, E-93 | | high-level language architecture | F-51 | routing and, E-47, E-53 to E-55, | | in, K-11 | instructions in, F-8 | E-54 | | historical floating point formats | load-store units in, F-6, <b>F-7</b> , F-13 | switching and, E-51, E-58, E-61 | | in, I-63 | to F-14 | E-73 | | memory addressing in, B-8, B-10 | memory systems in, F-14 to F-16, | virtual cut-through switching, E-51, | | operand specifiers in, J-67 to J-68, | F-15, F-22 to F-23, F-45 | E-73, E-92 | | J-68 | multiple lanes in, F-29 to F-31, | virtual functions, register indirect | | operations, J-70 to J-72, <b>J-71</b> , | F-29, F-30 | jumps for, B-18 | | J-73 | multi-streaming, F-43 | Virtual Machine Control State | | pipelining microinstruction | operation example, F-8 to F-10 | (VMCS), 340 | | execution in, A-46 | peak performance in, F-36, F-40 | virtual machine monitors (VMMs) | | sort procedure in, J-76 to J-79, | performance measures in, F-34 to | instruction set architectures and, | | J-76, J-80 | F-35, <b>F-35</b> | 319–320, 338–340, <b>340</b> | | | pipelined instruction start-up in. | Intel 80x86, 320, 321, 339, <b>340</b> | | swap procedure in. J-72 to J-76,<br><b>J-74, J-75,</b> J-79 | F-31 to F-32, <b>F-31</b> | overview of, 315, 318 | | /AX 11/780, 2, <b>3,</b> K-6 to K-7, K-11 | scalar performance and, F-44 to | page tables in, 320–321 | | /AX 8600, A-76 | F-45, <b>F-45</b> | requirements of, 318–319 | | /AX 8700 | sparse matrices in, F-26 to F-29 | Xen VMM example, 321–324, | | architecture of, K-13 to K-14, | sustained performance in, F-37 to | 322, 323 | | K-14 | F-38 | virtual machines (VM), 317–324 | | | vector execution time, F-10 to | defined, 317 | | MIPS M2000 compared with, | F-13, <b>F-13</b> | | | J-81, <b>J-82</b> | • | impact on virtual memory and I/O, 320–321 | | pipelining cost-performance in,<br>A-76 | vector stride in, F-21 to F-23 | | | /AX 8800, A-46 | vector-length control, F-16 to | instruction set architectures and,<br>319–320 | | | F-21, <b>F-17</b> , <b>F-19</b> , F-35 | | | rector architectures | vector-mask control, F-25 to F-26, | overview of, 317–318 | | advantages of, B-31 to B-32, F-47 | F-28 | Xen VMM example, 321–324, | | compiler effectiveness in, F-32 to | vector-mask registers, F-26 | 322, 323 | | F-34, <b>F-33, F-34</b> | vector-register processors | virtual memory, C-38 to C-55 | | in Cray X1, F-40 to F-41, <b>F-41</b> | characteristics of various, F-7 | in 64-bit Opteron, C-53 to C-55, | | in embedded systems, D-10 | components of, F-4 to F-6, <b>F-5</b> , | C-54, C-55 | | vector instructions, 68 | F-7, F-8 | address translations in, C-40, | | rector length | defined, F-4 | C-44 to C-47, <b>C-45</b> , <b>C-47</b> | | average, F-37 | vector-length control in, F-16 to | block replacement in, C-43 to | | control, F-16 to F-21, <b>F-17, F-19</b> , | F-21, <b>F-17</b> , <b>F-19</b> | C-44 | | | | | | F-35 optimization. F-37 | VelociTI 320C6x processors, D-8 to D-10, <b>D-9, D-10</b> | caches compared with, C-40,<br>C-41 | | virtual memory (continued) | peak performance in, F-36 | WB. See write-back cycles | |---------------------------------------------------------|---------------------------------------------|----------------------------------------------| | defined, C-3 | processor characteristics in, F-7 | WCET (worst case execution time), | | development of, K-53 | sustained performance in, F-37 to | D-4 | | function of, C-39 | F-38 | weak ordering, 246, K-44 | | in IBM 370, J-84 | vector length control in, F-19 to | Web server benchmarks, 32-33, 377 | | impact of virtual machines on, | F-20 | Web sites | | 320–321 | vector stride in, F-22 | availability of, 400 | | in Intel Pentium, C-48, C-49 to | VMM. See virtual machine monitors | on multiple-issue processor | | C-52, <b>C-51</b> | voltage, adjustable, 18 | development, K-21 | | mapping to physical memory, | von Neumann, J., 287, I-62, K-2 to | for OpenMP consortium, H-5 | | C-39, <b>C-40</b> | K-3 | for SPEC benchmarks, 30 | | in memory hierarchy, C-40, C-41, | von Neumann computers, K-3 | for Transaction Processing | | C-42 to C-44, <b>C-43</b> | VPU processors, D-17 to D-18 | Council, 32 | | miss penalties in, C-40, C-42 | VS registers, F-6 | weighted arithmetic mean time, 383 | | in Opteron, C-53 to C-55, C-54, | VT-x, 339–340 | Weitek 3364 chip, I-58, I-58, I-60, | | C-55 | | I-61 | | page sizes in, C-45 to C-46 | W | West, N., I-65 | | paged vs. segmented, C-40 to | wafer yield, 23-24 | Whetstone synthetic program, K-6 | | C-42, <b>C-41, C-42</b> | wafers, costs of, 21-22, 23 | Whirlwind project, K-4 | | protection and, 315–317, | waiting line, 380. See also queuing | wide area networks (WAN), E-4, E-4, | | 324–325, C-39 | theory | <b>E-75,</b> E-79, E-97 to E-99. <i>See</i> | | relocation in, C-39, C-40 | Wall, D. W., 154, 169-170, K-25 | also interconnection | | size of, C-40 | Wallace trees, I-53 to I-54, I-53, I-63 | networks | | translation lookaside buffers and, | wall-clock time, 28 | Wilkes, Maurice, 310, B-1, K-3, K-52, | | 317, 320, <b>323</b> , C-36, C-43 to | WAN (wide area networks), E-4, E-4, | K-53 | | C-45, C <b>-45</b> | <b>E-75,</b> E-79, E-97 to E-99. <i>See</i> | Williams, T. E., I-52 | | virtual output queues (VOQ), E-60, | also interconnection | Wilson, R. P., 170 | | E-66 | networks | Winchester disk design, K-60 | | virtually indexed, physically tagged | Wang, WH., K-54 | window (instructions) | | optimization, 291–292, C-38, | WAR (write after read) hazards | effects of limited size of, | | C-46 | hardware-based speculation and, | 158–159, <b>159</b> , 166–167, <b>166</b> | | VLIW Multiflow compiler, 297 | 112 | defined, 158 | | VLIW processors, 114–118. See also | as ILP limitations, 72, 169 | limitations on size of, 158 | | Intel IA-64 | in pipelines, 90 | in scoreboarding, A-74 | | characteristics of, 114–115, 115 | in scoreboarding, A-67, A-69 to | in TCP, E-84 | | in embedded systems, D-8 to | A-70, A-72, <b>A-75</b> | windowing, E-65 | | D-10, <b>D-9, D-10</b> EDIC approach in G-23 | Tomasulo's approach and, 92, 98 | wireless networks, D-21 to D-22, <b>D-21</b> | | EPIC approach in, G-33 historical perspectives on, K-21 | wavelength division multiplexing | within vs. between instructions, A-41, A-42 | | overview of, 115–118, 117 | (WDM), E-98 | Wolfe, M., F-51 | | VLVCU (load vector count and | WAW (write after write) hazards | word count field, C-51, C-52 | | update), F-18 | in floating-point pipelines, A-50, | word operands, B-13 | | VM. See virtual machines | A-52 to A-53 | working set effect, H-24 | | VME racks, 393, <b>394</b> | hardware-based speculation and,<br>112 | workloads, execution time of, 29 | | VMIPS | as ILP limitations, 71, 169 | World Wide Web, 6, E-98 | | architecture of, F-4 to F-6, <b>F-5</b> , | in pipelines, 90 | wormhole switching, E-51, E-58, | | F-7, F-8 | in scoreboarding, A-67, A-69, | E-88, E-92 to E-93 | | memory pipelines on, F-38 to | A-75 to A-76 | worst case execution time (WCET), | | F-40 | Tomasulo's approach and, 92, | D-4 | | multiple lanes in, F-29 to F-31, | 98–99 | write allocate, C-11 to C-12 | | F-29, F-30 | way prediction, 295, <b>309</b> | write back, in virtual memory, C-44 | | operation example, F-8 to F-10 | Wayback Machine, 393 | write buffers | | - · | . ,, 0/2 | | | defined, C-11 | write-through caches | |-----------------------------------------|------------------------------------------| | function of, 289, 291 | advantages and disadvantages of, | | merging, 300–301, <b>301, 309</b> | C-11 to C-12 | | read misses and, 291, C-34 to | defined, C-10 | | C-35 | invalidate protocols and, 210, 211, | | in snooping protocols, 210 | 212 | | write invalidate protocols | I/O coherency and, 326 | | in directory-based cache | write buffers and, C-35 | | coherence protocols, 233, 234 | Wu, Chuan-Lin, E-1 | | example of, 212, 213, 214 | | | implementation of, 209–211 | X | | overview, 208–209, <b>209</b> | X1 nodes, F-42, <b>F-42</b> | | write merging, 300–301, <b>301, 309</b> | Xen VMM, 321–324, <b>322, 323</b> | | write miss | Xeon-MP, 198 | | directory protocols and, 231, 233, | XIE, F-7 | | 234–237, <b>235, 236</b> | XIMD architecture, K-27 | | in large-scale multiprocessors, | Xon/Xoff flow control, E-10 | | H-35, H-39 to H-40 | Aon Aon now condoi, E 10 | | sequential consistency and, 244 | V | | in snooping protocols, 212–214, | Υ | | 213, 214 | Yajima, S., I-65 | | in spinning, 241, <b>242</b> | Yamamoto, W., K-27 | | write allocate vs. no-write | Yasuura, H., I-65 | | allocate, C-11 to C-12 | yields, 19–20, <b>20,</b> 22–24 | | write result stage of pipeline, 96, | arge-Scale Multipr | | 100–101, <b>103,</b> 108, 112 | 2 | | write serialization, 206–207 | omputer Arithmeti oraș | | write speed, C-9 to C-10 | finding zero iteration, I-27 to I-29, | | write stalls, C-11 | I-28 | | write update (broadcast) protocol, 209, | in floating-point multiplication, | | 217 | I-21 | | write-back caches | shifting over, I-45 to I-47, <b>I-46</b> | | advantages and disadvantages of, | signed, I-62 | | C-10 to C-12 | zero-copy protocols, E-8, E-91 | | cache coherence and, H-36 | zero-load, E-14, E-25, E-52, E-53, | | consistency in, 289 | E-92 | | defined, C-10 | zSeries, F-49 | | directory protocols and, 235, 236, | Zuse, Konrad, K-4 | | 237 | | | invalidate protocols and, 210, | | | 211–212, <b>213, 214</b> | | | in Opteron microprocessor, C-14 | | | reducing cost of writes in, C-35 | | | write-back cycles (WB) | | | in floating-point pipelining, A-51, | | | A-52 | | | in RISC instruction set, A-6 | | | in unpipelined MIPS | | | implementation, A-28, A-29 | | | writes to disks 364 | | #### **About the CD** The CD that accompanies this book includes: - Reference appendices. These appendices—some guest authored by subject experts—cover a range of topics, including specific architectures, embedded systems, and application-specific processors. - Historical Perspectives and References. Appendix K includes several sections exploring the key ideas presented in each of the chapters in this text. References for further reading are also provided. - Search engine. A search engine is included, making it possible to search for content in both the printed text and the CD-based appendices. #### Appendices on the CD - Appendix D: Embedded Systems - Appendix E: Interconnection Networks - Appendix F: Vector Processors - Appendix G: Hardware and Software for VLIW and EPIC - Appendix H: Large-Scale Multiprocessors and Scientific Applications - Appendix I: Computer Arithmetic - Appendix J: Survey of Instruction Set Architectures - Appendix K: Historical Perspectives and References | Srìn | ivas Institute of Technology | |------|------------------------------| | Acc. | No 17049 | | | No.: |